w07-聚类分析
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Clustering
Clustering Methods
• • • • • Partitional Hierarchial Density-based Mixture model Spectral methods
Density-based Clustering
• Basic idea
– Clusters are dense regions in the data space, separated by regions of lower object density – A cluster is defined as a maximal set of densityconnected points – Discovers clusters of arbitrary shape
• DBSCAN (Density-based Spatial Clustering of Applications with Noise)
•
N of( p ) : {q | d -Neighborhood – Objects within a radius from an object.
at least MinPts of objects.
When DBSCAN Works Well When DBSCAN Works Well
ቤተ መጻሕፍቲ ባይዱ
Original Points
Clusters
• Resistant to Noise • Can handle clusters of different shapes and sizes
When DBSCAN Does NOT NOT Work Well When DBSCAN Does Work Well
Expectation-Maximization (EM) Algorithm
Expectation-Maximization (EM) Algorithm
• E-step: for given parameter values we can compute the expected values of the latent variables (responsibilities of data points)
• -Neighborhood – Objects within a Density Definition Definition an object.
•
N density ( p) : {q”| dε ( -Neighborhood p, q) } • “High of least MinPts of ofan objects. “High density”at - ε-Neighborhood object contains
Can be used for non-spherical clusters Can generate clusters with different probabilities
20
Mixture Model
• Strengths
Mixture Model
– Give probabilistic cluster assignments – Have probabilistic interpretation – Can handle clusters with varying sizes, variance etc.
where
parameters to be estimated parameters to be estimated
8
8
Gaussian Mixture
Gaussian Mixture
• To generate a data point: – first pick one of the components with probability – then draw a sample from that component distribution • Each data point is generated by one of K components, a latent variable is associated with each
Motivation
• Complex cluster shapes
– Spectral approach
• K-means performs poorly because it performs can only findbecause spherical – K-means poorly it can only f clusters clusters – Density-based approaches are sensitive to par • Density-based approaches are sensitive to parameters • Spectral approach • Use information • Data points are vertices of the graph • Connect points which are “close”
Gaussian Distribution
Likelihood
Gaussian Distribution
Maximum Likelihood Estimate
Gaussian Mixture Gaussian Mixture Gaussian Mixture
where
• Linear combination of Gaussians Linear combination of Gaussians
• Can be optimized by an EM algorithm
– E-step: assign points to clusters – M-step: optimize clusters – Performs hard assignment during E-step
• •
• Assumes spherical clusters with equal probability of a cluster
• Probabilistic clustering
– Each cluster is mathematically represented by a parametric distribution – The entire data set is modeled by a mixture of these distributions
ε q
p
ε
ε q
ε-Neighborhood of p ε pε-Neighborhood of q Density of Density of p is “high” (MinPts = 4)
ε-Neigh ε-Neigh
p
Density of q is “low” (MinPts = 4) Density of
– Note that still have
instead of
but we
11
Expectation-Maximization (EM) Algorithm
Expectation-Maximization (EM) Algorithm
• M-step: maximize the expected complete log likelihood
• Parameter update:
12
EM Algorithm
EM Algorithm
• Iterate E-step and M-step until the log likelihood of data does not increase any more. – Converge to local optimal – Need to restart algorithm with different initial guess of parameters (as in K-means) • Relation to K-means – Consider GMM with common covariance
12
DBSCAN: Determining EPS EPS and DBSCAN: Determining andMinPts MinPts
• • • Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance Noise points have the kth nearest neighbor at farther distance So, plot sorted distance of every point to its kth nearest neighbor
9
Gaussian Mixture Gaussian Mixture
• Maximize log likelihood
• Without knowing values of latent variables, we have to maximize the incomplete log likelihood
• Weakness
– Initialization matters – Choose appropriate distributions – Overfitting issues
Clustering: Spectral Methods
• Motivation
– Complex cluster shapes
15
Using Probabilistic Models for Clustering
Mixture Model-based Clustering
• Hard vs. soft clustering
– Hard clustering: Every point belongs to exactly one cluster – Soft clustering: Every point belongs to several clusters with certain degrees
7
Density-reachability
DBSCAN Algorithm: Example
DBSCAN Algorithm: Example
DBSCAN Algorithm: Example
DBSCAN:Sensitive Sensitive to DBSCAN: to Parameters Parameters
q
Core, Border & Outlier
Example
Density-reachability
•
Directly density-reachable • An object q is directly density-reachable from object p if p is a core object and q is in p’s -neighborhood.
– Maximize log-likelihood
EM algorithm
– E-step: Compute posterior probability of membership – M-step: Optimize parameters – Perform soft assignment during E-step
(MinPts=4, Eps=9.92).
Original Points
• Cannot handle varying densities • sensitive to parameters—hard to determine the correct set of parameters
(MinPts=4, Eps=9.75)
– As , two methods coincide
13
14 14
15 15
18 18
19 19
17 17
K-means vsGMM GMM K-means vs.
• Objective function
– Minimize sum of squared Euclidean distance • • Objective function
Density-reachability
ε q
p
ε
• •
q is directly density-reachable from p p is not directly density-reachable from q • Density-reachability is asymmetric
MinPts = 4
Clustering Methods
• • • • • Partitional Hierarchial Density-based Mixture model Spectral methods
Density-based Clustering
• Basic idea
– Clusters are dense regions in the data space, separated by regions of lower object density – A cluster is defined as a maximal set of densityconnected points – Discovers clusters of arbitrary shape
• DBSCAN (Density-based Spatial Clustering of Applications with Noise)
•
N of( p ) : {q | d -Neighborhood – Objects within a radius from an object.
at least MinPts of objects.
When DBSCAN Works Well When DBSCAN Works Well
ቤተ መጻሕፍቲ ባይዱ
Original Points
Clusters
• Resistant to Noise • Can handle clusters of different shapes and sizes
When DBSCAN Does NOT NOT Work Well When DBSCAN Does Work Well
Expectation-Maximization (EM) Algorithm
Expectation-Maximization (EM) Algorithm
• E-step: for given parameter values we can compute the expected values of the latent variables (responsibilities of data points)
• -Neighborhood – Objects within a Density Definition Definition an object.
•
N density ( p) : {q”| dε ( -Neighborhood p, q) } • “High of least MinPts of ofan objects. “High density”at - ε-Neighborhood object contains
Can be used for non-spherical clusters Can generate clusters with different probabilities
20
Mixture Model
• Strengths
Mixture Model
– Give probabilistic cluster assignments – Have probabilistic interpretation – Can handle clusters with varying sizes, variance etc.
where
parameters to be estimated parameters to be estimated
8
8
Gaussian Mixture
Gaussian Mixture
• To generate a data point: – first pick one of the components with probability – then draw a sample from that component distribution • Each data point is generated by one of K components, a latent variable is associated with each
Motivation
• Complex cluster shapes
– Spectral approach
• K-means performs poorly because it performs can only findbecause spherical – K-means poorly it can only f clusters clusters – Density-based approaches are sensitive to par • Density-based approaches are sensitive to parameters • Spectral approach • Use information • Data points are vertices of the graph • Connect points which are “close”
Gaussian Distribution
Likelihood
Gaussian Distribution
Maximum Likelihood Estimate
Gaussian Mixture Gaussian Mixture Gaussian Mixture
where
• Linear combination of Gaussians Linear combination of Gaussians
• Can be optimized by an EM algorithm
– E-step: assign points to clusters – M-step: optimize clusters – Performs hard assignment during E-step
• •
• Assumes spherical clusters with equal probability of a cluster
• Probabilistic clustering
– Each cluster is mathematically represented by a parametric distribution – The entire data set is modeled by a mixture of these distributions
ε q
p
ε
ε q
ε-Neighborhood of p ε pε-Neighborhood of q Density of Density of p is “high” (MinPts = 4)
ε-Neigh ε-Neigh
p
Density of q is “low” (MinPts = 4) Density of
– Note that still have
instead of
but we
11
Expectation-Maximization (EM) Algorithm
Expectation-Maximization (EM) Algorithm
• M-step: maximize the expected complete log likelihood
• Parameter update:
12
EM Algorithm
EM Algorithm
• Iterate E-step and M-step until the log likelihood of data does not increase any more. – Converge to local optimal – Need to restart algorithm with different initial guess of parameters (as in K-means) • Relation to K-means – Consider GMM with common covariance
12
DBSCAN: Determining EPS EPS and DBSCAN: Determining andMinPts MinPts
• • • Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance Noise points have the kth nearest neighbor at farther distance So, plot sorted distance of every point to its kth nearest neighbor
9
Gaussian Mixture Gaussian Mixture
• Maximize log likelihood
• Without knowing values of latent variables, we have to maximize the incomplete log likelihood
• Weakness
– Initialization matters – Choose appropriate distributions – Overfitting issues
Clustering: Spectral Methods
• Motivation
– Complex cluster shapes
15
Using Probabilistic Models for Clustering
Mixture Model-based Clustering
• Hard vs. soft clustering
– Hard clustering: Every point belongs to exactly one cluster – Soft clustering: Every point belongs to several clusters with certain degrees
7
Density-reachability
DBSCAN Algorithm: Example
DBSCAN Algorithm: Example
DBSCAN Algorithm: Example
DBSCAN:Sensitive Sensitive to DBSCAN: to Parameters Parameters
q
Core, Border & Outlier
Example
Density-reachability
•
Directly density-reachable • An object q is directly density-reachable from object p if p is a core object and q is in p’s -neighborhood.
– Maximize log-likelihood
EM algorithm
– E-step: Compute posterior probability of membership – M-step: Optimize parameters – Perform soft assignment during E-step
(MinPts=4, Eps=9.92).
Original Points
• Cannot handle varying densities • sensitive to parameters—hard to determine the correct set of parameters
(MinPts=4, Eps=9.75)
– As , two methods coincide
13
14 14
15 15
18 18
19 19
17 17
K-means vsGMM GMM K-means vs.
• Objective function
– Minimize sum of squared Euclidean distance • • Objective function
Density-reachability
ε q
p
ε
• •
q is directly density-reachable from p p is not directly density-reachable from q • Density-reachability is asymmetric
MinPts = 4