数据挖掘:实用机器学习工具与技术_06算法实现及其细节
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Data Mining
Practical Machine Learning Tools and Techniques
Slides for Chapter 6 of Data Mining by I. H. Witten, E. Frank and M. A. Hall
Implementation:
●
●
Time complexity for sorting: O (n log n)
Does this have to be repeated at each node of the tree? ●No! Sort order for children can be derived from sort order for parent
●
Bayesian networks
●
Clustering: hierarchical, incremental, probabilistic
●
Semisupervised learning
●
Multi-instance learning
Converting to single-instance, upgrading learning algorithms, dedicated multi-instance methods
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2
Implementation:
●
Real machine learning schemes
Numeric prediction
Regression/model trees, locally weighted regression Learning and prediction, fast data structures for learning Hierarchical, incremental, probabilistic, Bayesian Clustering for classification, co-training
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3
Industrial-strength algorithms
For an algorithm to be useful in a wide range of real-world applications it must:
●
Real machine learning schemes
Decision trees
From ID3 to C4.5 (pruning, numeric attributes, ...)
●
Classification rules Association Rules
From PRISM to RIPPER anቤተ መጻሕፍቲ ባይዱ PART (pruning, numeric data, …)
Numeric attributes
●
Standard method: binary splits
E.g. temp < 45
●
Unlike nominal attributes, every attribute has many possible split points Solution is straightforward extension:
●
Computationally more demanding
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6
Weather data (again!)
Outlook Sunny Sunny Overcast Temperature Hot Hot Hot Humidity High High High Windy False True False Play No No Yes
Example
●
Split on temperature attribute:
65 No 68 Yes 69 70 71 72 72 75 75 80 81 83 85 Yes Yes No No Yes Yes Yes No Yes Yes No
64 Yes
E.g. temperature 71.5: yes/4, no/2 temperature 71.5: yes/5, no/3
Rainy
… Rainy
Mild
… Cool
Normal High
… Normal
False
… False
Yes
… Yes
Rainy …
…
Cool …
…
Normal …
…
True …
…
No …
…
If If If If If
outlook = sunny and humidity = high then play = no outlook = rainy and windy = true then play = no outlook = overcast then play = yes humidity = normal then play = yes none of the above then play = yes
No … False
… … False True … …
Yes
… Yes No … …
If If If If If
outlook = sunny and humidity = high then play = no outlook = rainy and windy = true then play = no outlook = overcast then play = yes humidity = normal then play = yes none of the above then play = yes If outlook = sunny and humidity > 83 then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity < 85 then play = no If none of the above then play = yes Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8
●
Evaluate info gain (or other measure) for every possible split point of attribute Choose “best” split point Info gain for best split point is info gain for attribute
Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939 bits
● ●
Place split points halfway between values Can evaluate all split points in one pass!
Binary vs multiway splits
Splitting (multi-way) on a nominal attribute exhausts all information in that attribute
●
Nominal attribute is tested (at most) once on any path in the tree
Rainy
… Rainy
MildSunny
Overcast Cool …
Yes True
… False Yes
Rainy …
…
Cool … Rainy
… … Rainy Rainy … …
Normal … Mild 70
… … 68 65 … …
True … Normal 96
… … 80 70 … …
●
Frequent-pattern trees
●
Extending linear models
Support vector machines and neural networks
●
Instance-based learning
Pruning examples, generalized exemplars, distance functions
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 9
Can avoid repeated sorting
Sort instances by the values of the numeric attribute
●
Not so for binary splits on numeric attributes! Disadvantage: tree is hard to read Remedy:
Numeric attribute may be tested several times along a path in the tree
Time complexity of derivation: O (n) Drawback: need to create and store an array of sorted indices for each numeric attribute
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 10
●
●
Extending ID3:
to permit numeric attributes:straightforward ●to deal sensibly with missing values:trickier ●stability for noisy data:
requires pruning mechanism
●
Permit numeric attributes Allow missing values Be robust in the presence of noise Be able to approximate arbitrary concept descriptions (at least in principle)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter
7
Weather data (again!)
Outlook Sunny Sunny Overcast Temperature Hot Hot Outlook HotSunny Humidity High High Temperature High Hot 85 80 Normal High Hot 83 Normal … Hot Windy False True Humidity FalseHigh 85 90 FalseHigh 86 False … High Play No No Windy Yes False Play No No Yes
●
●
End result: C4.5 (Quinlan)
Best-known and (probably) most widely-used learning algorithm ●Commercial successor: C5.0
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5
Basic schemes need to be extended to fulfill these requirements
●
Data Mining: Practical Machine Learning Tools and Techniques (Chapter
4
Decision trees
Practical Machine Learning Tools and Techniques
Slides for Chapter 6 of Data Mining by I. H. Witten, E. Frank and M. A. Hall
Implementation:
●
●
Time complexity for sorting: O (n log n)
Does this have to be repeated at each node of the tree? ●No! Sort order for children can be derived from sort order for parent
●
Bayesian networks
●
Clustering: hierarchical, incremental, probabilistic
●
Semisupervised learning
●
Multi-instance learning
Converting to single-instance, upgrading learning algorithms, dedicated multi-instance methods
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2
Implementation:
●
Real machine learning schemes
Numeric prediction
Regression/model trees, locally weighted regression Learning and prediction, fast data structures for learning Hierarchical, incremental, probabilistic, Bayesian Clustering for classification, co-training
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3
Industrial-strength algorithms
For an algorithm to be useful in a wide range of real-world applications it must:
●
Real machine learning schemes
Decision trees
From ID3 to C4.5 (pruning, numeric attributes, ...)
●
Classification rules Association Rules
From PRISM to RIPPER anቤተ መጻሕፍቲ ባይዱ PART (pruning, numeric data, …)
Numeric attributes
●
Standard method: binary splits
E.g. temp < 45
●
Unlike nominal attributes, every attribute has many possible split points Solution is straightforward extension:
●
Computationally more demanding
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6
Weather data (again!)
Outlook Sunny Sunny Overcast Temperature Hot Hot Hot Humidity High High High Windy False True False Play No No Yes
Example
●
Split on temperature attribute:
65 No 68 Yes 69 70 71 72 72 75 75 80 81 83 85 Yes Yes No No Yes Yes Yes No Yes Yes No
64 Yes
E.g. temperature 71.5: yes/4, no/2 temperature 71.5: yes/5, no/3
Rainy
… Rainy
Mild
… Cool
Normal High
… Normal
False
… False
Yes
… Yes
Rainy …
…
Cool …
…
Normal …
…
True …
…
No …
…
If If If If If
outlook = sunny and humidity = high then play = no outlook = rainy and windy = true then play = no outlook = overcast then play = yes humidity = normal then play = yes none of the above then play = yes
No … False
… … False True … …
Yes
… Yes No … …
If If If If If
outlook = sunny and humidity = high then play = no outlook = rainy and windy = true then play = no outlook = overcast then play = yes humidity = normal then play = yes none of the above then play = yes If outlook = sunny and humidity > 83 then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity < 85 then play = no If none of the above then play = yes Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8
●
Evaluate info gain (or other measure) for every possible split point of attribute Choose “best” split point Info gain for best split point is info gain for attribute
Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939 bits
● ●
Place split points halfway between values Can evaluate all split points in one pass!
Binary vs multiway splits
Splitting (multi-way) on a nominal attribute exhausts all information in that attribute
●
Nominal attribute is tested (at most) once on any path in the tree
Rainy
… Rainy
MildSunny
Overcast Cool …
Yes True
… False Yes
Rainy …
…
Cool … Rainy
… … Rainy Rainy … …
Normal … Mild 70
… … 68 65 … …
True … Normal 96
… … 80 70 … …
●
Frequent-pattern trees
●
Extending linear models
Support vector machines and neural networks
●
Instance-based learning
Pruning examples, generalized exemplars, distance functions
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 9
Can avoid repeated sorting
Sort instances by the values of the numeric attribute
●
Not so for binary splits on numeric attributes! Disadvantage: tree is hard to read Remedy:
Numeric attribute may be tested several times along a path in the tree
Time complexity of derivation: O (n) Drawback: need to create and store an array of sorted indices for each numeric attribute
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 10
●
●
Extending ID3:
to permit numeric attributes:straightforward ●to deal sensibly with missing values:trickier ●stability for noisy data:
requires pruning mechanism
●
Permit numeric attributes Allow missing values Be robust in the presence of noise Be able to approximate arbitrary concept descriptions (at least in principle)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter
7
Weather data (again!)
Outlook Sunny Sunny Overcast Temperature Hot Hot Outlook HotSunny Humidity High High Temperature High Hot 85 80 Normal High Hot 83 Normal … Hot Windy False True Humidity FalseHigh 85 90 FalseHigh 86 False … High Play No No Windy Yes False Play No No Yes
●
●
End result: C4.5 (Quinlan)
Best-known and (probably) most widely-used learning algorithm ●Commercial successor: C5.0
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5
Basic schemes need to be extended to fulfill these requirements
●
Data Mining: Practical Machine Learning Tools and Techniques (Chapter
4
Decision trees