机器知识学习chap7-Clustering

合集下载

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

CHAPTER 7: Clustering

Outline

What is clustering

Application examples

Data in clustering

Partition methods: K-means

Model based methods: EM

Hierarchical methods: Agglomerative nesting

Model selection

Conclusion

2

Clustering

Cluster: a collection of data objects

Similar to one another within the same cluster

Dissimilar to the objects in other clusters

Clustering:

Grouping a set of data objects into clusters

Clustering is in an unsupervised manner (no sample category labels provided)

Typical applications

As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

3

What is a natural grouping among these objects?

Clustering is subjective

Simpson's Family School Employees Females Males

4

What is similarity?

5

Defining distance measures

6

Examples of Clustering Applications

Marketing:Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use:Identification of areas of similar land use in an earth observation database

Insurance:Identifying groups of motor insurance policy holders with a high average claim cost

City-planning:Identifying groups of houses according to their house type, value, and geographical location

7

Data in clustering

Data matrix

Dissimilarity matrix 8

⎥⎥⎥⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎢⎢⎢⎣⎡np x ...nf x ...n1x ...............ip x ...if x ...i1x ...............1p x ...1f x ...11x ⎥⎥⎥⎥⎥⎥⎦

⎤⎢⎢⎢⎢⎢⎢⎣⎡0...)2,()1,(::

:)

2,3()...n d n d 0d d(3,10

d(2,1)0

Data example

9

...........

....

Major clustering approaches

Partitioning algorithms: Construct various partitions and then evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion

Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other

10

Squared error

12

k-means Clustering

13

14

15

16

17

18

Comments on the K-Means Method

Strength

Relatively efficient: O(tkn), where n is # objects, k is #

clusters, and t is # iterations. Normally, k, t << n.

Often terminates at a local optimum. The global optimum

may be found using techniques such as: deterministic

annealing and genetic algorithms

Weakness

Applicable only when mean is defined, then what about

categorical data?

Need to specify k, the number of clusters, in advance

Unable to handle noisy data and outliers

Not suitable to discover clusters with non-convex shapes

19