clustering数据挖掘聚类算法介绍

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Detect spatial clusters or for other spatial mining tasks

Image Processing Economic Science (especially market research)

Software package
S-Plus, SPSS, SAS
2011/10/31 6
Examples of Clustering Applications

Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
Biology: categorize genes with similar functionality

Cluster analysis
Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters
Baidu Nhomakorabea

Able to deal with noise and outliers
Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability
Analysis A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Outlier Analysis
Summary
2011/10/31
11
Data Structures

Standardize data
Calculate the mean absolute deviation:
sf 1 n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where
mf 1 n (x1 f x2 f
...
The answer is typically highly subjective

2011/10/31
10
Cluster Analysis
What is Cluster Analysis? Density-Based Methods
Types of Data in Cluster Grid-Based Methods

Interval-scaled variables Binary variables Nominal variables Ordinal variables ratio variables Variables of mixed types
2011/10/31
13
Interval-valued Variables
14
Similarity and Dissimilarity Between Objects

Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance:

Dissimilarity matrix
(one mode)
0 d(2,1) 0 d(3,1) d ( 3,2) 0 : : : d ( n,1) d ( n,2) ...
... 0
12
2011/10/31
Type of Data in Clustering Analysis

The quality of a clustering method is also measured
by its ability to discover some or all of the hidden
patterns
2011/10/31 8
Requirements of Clustering

Data matrix
(two modes)
x11 ... x i1 ... x n1
... ... ...
x1f ... xif
... ... ... ... ...
... ... ... xnf
x1p ... xip ... xnp
Data mining—core of knowledge discovery process
Selection and Transformation
Pattern Evaluation
Data Mining
Data Warehouse Data Cleaning and Integration Databases

WWW
Document classification Cluster Weblog data to discover groups of similar accessing patterns
2011/10/31
7
What Is Good Clustering?

A good clustering method will produce high quality clusters with
high intra-class similarity
low inter-class similarity

The quality of a clustering result depends on both the similarity measure used by the method and its implementation
2011/10/31
Flat files
2
Cluster Analysis
What is Cluster Analysis? Density-Based Methods
Types of Data in Cluster Grid-Based Methods
Analysis A Categorization of Major Clustering Methods
2011/10/31
4
Clustering: Rich Applications and Multidisciplinary Efforts

Pattern Recognition
GIS
Create thematic maps in GIS by clustering feature spaces

Unsupervised learning: no predefined classes Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms

If q = 1, d is Manhattan distance
d (i, j) | x x | | x x | ... | x x | i1 j1 i2 j2 ip jp
2011/10/31
5
Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location
d (i, j) q (| x x |q | x x |q ... | x x |q ) i1 j1 i2 j2 ip jp where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and q is a positive integer

Scalability Ability to deal with different types of attributes Ability to handle dynamic data

Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to determine input parameters
xnf )
.
Calculate the standardized measurement (z-score)
xif m f zif sf

Using mean absolute deviation is more robust than
using standard deviation
2011/10/31
The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables Weights may be assigned to different variables based on applications and data semantics It is hard to define ―similar enough‖ or ―good enough‖
Data Mining
Instructor: Associate Prof. Ying Liu
Graduate University of Chinese Academy of Sciences Fictitious Economy and Data Science Research Center
Review
9
2011/10/31
Measure the Quality of Clustering

Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j)
Partitioning Methods
Hierarchical Methods
Outlier Analysis
Summary
2011/10/31
3
What is Cluster Analysis?

Cluster: a collection of data objects
Similar to one another within the same cluster Dissimilar to the objects in other clusters