数据挖掘:实用机器学习工具与技术_06算法实现及其细节

合集下载
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Data Mining
Practical Machine Learning Tools and Techniques
Slides for Chapter 6 of Data Mining by I. H. Witten, E. Frank and M. A. Hall
Implementation:



Time complexity for sorting: O (n log n)
Does this have to be repeated at each node of the tree? ●No! Sort order for children can be derived from sort order for parent

Bayesian networks


Clustering: hierarchical, incremental, probabilistic


Semisupervised learning


Multi-instance learning

Converting to single-instance, upgrading learning algorithms, dedicated multi-instance methods

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2
Implementation:

Real machine learning schemes
Numeric prediction

Regression/model trees, locally weighted regression Learning and prediction, fast data structures for learning Hierarchical, incremental, probabilistic, Bayesian Clustering for classification, co-training
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3
Industrial-strength algorithms
For an algorithm to be useful in a wide range of real-world applications it must:

Real machine learning schemes
Decision trees
From ID3 to C4.5 (pruning, numeric attributes, ...)


Classification rules Association Rules

From PRISM to RIPPER anቤተ መጻሕፍቲ ባይዱ PART (pruning, numeric data, …)
Numeric attributes

Standard method: binary splits
E.g. temp < 45


Unlike nominal attributes, every attribute has many possible split points Solution is straightforward extension:


Computationally more demanding
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6
Weather data (again!)
Outlook Sunny Sunny Overcast Temperature Hot Hot Hot Humidity High High High Windy False True False Play No No Yes
Example

Split on temperature attribute:
65 No 68 Yes 69 70 71 72 72 75 75 80 81 83 85 Yes Yes No No Yes Yes Yes No Yes Yes No
64 Yes

E.g. temperature 71.5: yes/4, no/2 temperature 71.5: yes/5, no/3
Rainy
… Rainy
Mild
… Cool
Normal High
… Normal
False
… False
Yes
… Yes
Rainy …

Cool …

Normal …

True …

No …

If If If If If
outlook = sunny and humidity = high then play = no outlook = rainy and windy = true then play = no outlook = overcast then play = yes humidity = normal then play = yes none of the above then play = yes
No … False
… … False True … …
Yes
… Yes No … …
If If If If If
outlook = sunny and humidity = high then play = no outlook = rainy and windy = true then play = no outlook = overcast then play = yes humidity = normal then play = yes none of the above then play = yes If outlook = sunny and humidity > 83 then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity < 85 then play = no If none of the above then play = yes Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8

Evaluate info gain (or other measure) for every possible split point of attribute Choose “best” split point Info gain for best split point is info gain for attribute
Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939 bits

● ●
Place split points halfway between values Can evaluate all split points in one pass!
Binary vs multiway splits
Splitting (multi-way) on a nominal attribute exhausts all information in that attribute

Nominal attribute is tested (at most) once on any path in the tree
Rainy
… Rainy
MildSunny
Overcast Cool …
Yes True
… False Yes
Rainy …

Cool … Rainy
… … Rainy Rainy … …
Normal … Mild 70
… … 68 65 … …
True … Normal 96
… … 80 70 … …


Frequent-pattern trees

Extending linear models
Support vector machines and neural networks


Instance-based learning
Pruning examples, generalized exemplars, distance functions
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 9
Can avoid repeated sorting
Sort instances by the values of the numeric attribute


Not so for binary splits on numeric attributes! Disadvantage: tree is hard to read Remedy:
Numeric attribute may be tested several times along a path in the tree
Time complexity of derivation: O (n) Drawback: need to create and store an array of sorted indices for each numeric attribute

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 10


Extending ID3:
to permit numeric attributes:straightforward ●to deal sensibly with missing values:trickier ●stability for noisy data:
requires pruning mechanism

Permit numeric attributes Allow missing values Be robust in the presence of noise Be able to approximate arbitrary concept descriptions (at least in principle)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter
7
Weather data (again!)
Outlook Sunny Sunny Overcast Temperature Hot Hot Outlook HotSunny Humidity High High Temperature High Hot 85 80 Normal High Hot 83 Normal … Hot Windy False True Humidity FalseHigh 85 90 FalseHigh 86 False … High Play No No Windy Yes False Play No No Yes


End result: C4.5 (Quinlan)
Best-known and (probably) most widely-used learning algorithm ●Commercial successor: C5.0
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5

Basic schemes need to be extended to fulfill these requirements

Data Mining: Practical Machine Learning Tools and Techniques (Chapter
4
Decision trees
相关文档
最新文档