数据挖掘之聚类分析
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
No labels
Data driven
3
Clusters
Inter-Cluster
Intra-Cluster
4
Clusters
5
Applications of Clustering
Marketing Finding groups of customers with similar behaviours. Biology Finding groups of animals or plants with similar features.
E[ zij ]
p( x xi | j ) j
p( x x | )
k 1 i k
ij
n
k
e
n k 1
1 2
2
( xi j ) 2
j k
e
1 2
2
( xi k ) 2
j
E[ z
i 1 m i 1
m
]xi
ij
No iteration
A single pass of the data No need to specify K in advance. Choose a cluster threshold value. For every new data point: Compute the distance between the new data point and every cluster's centre. If the minimum distance is smaller than the chosen threshold, assign the new data point to the corresponding cluster and re-compute cluster centre. Otherwise, create a new cluster with the new data point as its centre. Clustering results may be influenced by the sequence of data points.
1 2 3 4 5 6
2
1
0 0
X
11
Normalization or Not?
12
13
Evaluation
c
J e x mi ,
2 i 1 xDi
1 mi ni
xDi
x
VS.
14
Evaluation
15
Silhouette
A method of interpretation and validation of clusters of data. A succinct graphical representation of how well each data point lies within its cluster compared to other clusters. a(i): average dissimilarity of i with all other points in the same cluster b(i): the lowest average dissimilarity of i to other clusters
Finding groups of objects
Objects similar to each other are in the same group.
Objects are different from those in other groups. Unsupervised Learning
Clustering
LOGO
Overview
Partitioning Methods K-Means Sequential Leader Model Based Methods Density Based Methods Hierarchical Methods
2
What is cluster analysis?
Social Networks Discovering groups of individuals with close friendships internally.
6
Earthquakes
7
Image Segmentation
8
The Big Picture
9
Requirements
E[ z
]
1 m j E[ zij ] m i 1
32
33
Density Based Methods
Generate clusters of arbitrary shapes. Robust against noise. No K value required in advance.
Scalability Ability to deal with different types of attributes Ability to discover clusters with arbitrary shape Minimum requirements for domain knowledge Ability to deal with noise and outliers Insensitivity to order of input records Incorporation of user-defined constraints
Start from a randomly selected unseen point P.
If P is a core point, build a cluster by gradually adding all points that are density reachable to the current point set. Noise points are discarded (unlabelled).
Bioinformatics Clustering microarray data, genes and sequences.
Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones. WWW Clustering weblog data to discover groups of similar access patterns.
Somewhat similar to human vision.
34
DBSCAN
Density-Based Spatial Clustering of Applications with Noise Density: number of points within a specified radius Core Point: points with high density Border Point: points with low density but in the neighbourhood of a core point Noise Point: neither a core point nor a border point
21
Comments on K-Means
Pros Simple and works well for regular disjoint clusters. Converges relatively fast. Relatively efficient and scalable O(t·k·n)
Repeat until no more new assignment. Return the K centroids. Reference
J. MacQueen (1967): "Some Methods for Classification and Analysis of Multivariate Observations", Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol.1, pp. 281-297.
Leabharlann Baidu
b(i) a(i) s(i) max{ b(i), a(i)}
16
Silhouette
4
3
2
1
1
Cluster
0
-1
2
-2
-3 -3
-2
-1
0
1
2
3
4
-0.2
0
0.2
0.4 Silhouette Value
0.6
0.8
1
17
K-Means
18
K-Means
19
K-Means
20
K-Means
28
K-Means Revisited
model parameters
latent parameters
29
Expectation Maximization
30
31
EM: Gaussian Mixture
m : thenumber of data points n : thenumber of mixturecomponents zij : whethe r instancei is generatedby thejth Gaussian
Interpretability and usability
10
Practical Considerations
6
6
4
5
4
5
3
4
3
4
Y
2
Y
1
3
3
2
1
1
1
2
0 0 1 2 3 4 5 6
2
0 0 1 2 3 4 5 6
X
6
X1=0.2X
5
4
Y1=0.2Y
3
Scaling matters!
4 3 1 2
Core Point
Border Point
Noise Point
35
DBSCAN
q p p
q
directly density reachable
density reachable
p o
q
density connected
36
DBSCAN
A cluster is defined as the maximal set of density connected points.
May converge to local optima.
• In practice, try different initial centroids.
May be sensitive to noisy data and outliers.
• Mean of data points …
Not suitable for clusters of
Determine the value of K. Choose K cluster centres randomly. Each data point is assigned to its closest centroid.
Use the mean of each cluster to update each centroid.
25
26
Gaussian Mixture
g ( x, , )
n
1 2
2
e
( x ) 2 /( 2 2 )
f ( x) i g ( x, i , i ), i 0 & i 1
i 1 i
27
Clustering by Mixture Models
• Non-convex shapes
22
The Influence of Initial Centroids
23
The Influence of Initial Centroids
24
Sequential Leader Clustering
A very efficient clustering algorithm.
• t: iteration; k: number of centroids; n: number of data points
Cons Need to specify the value of K in advance.
• Difficult and domain knowledge may help.