实验报告聚类分析
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
实验报告聚类分析
实验原理:K均值聚类、中心点聚类、系统聚类和EM算法聚类分析技术。
实验题目:用鸢尾花的数据集,进行聚类挖掘分析。
实验要求:探索鸢尾花数据的基本特征,利用不同的聚类挖掘方法,获得基本结论并简明解释。
实验题目--分析报告:data(iris)
> rm(list=ls())
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 431730 929718 607591
Vcells 787605 8388608 1592403
> data(iris)
> datav-iris
> head(data)
1 Species
setosa
2 setosa
3 setosa
4 setosa
5 setosa
6 setosa
#Kmear聚类分析
> n ewiris <- iris
> n ewiris$Species <- NULL
> (kc <- kmea ns(n ewiris, 3))
K-mea ns clusteri ng with 3 clusters of sizes 62, 50, 38 Cluster mea ns:
1
Clusteri ng vector:
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[41] 2 2 2 2 2 2 2 2 2 2 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 1 1 [81] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 3 3 3 1 3 3 3 3 3 3 1 1 3 3 3 3 1 [121] 3 1 3 1 3 3 1 1 3 3 3 3 3 1 3 3 3 3 1 3 3 3 1 3 3 3 1 3 3 1
With in cluster sum of squares by cluster:
[1]
(between_SS / total_SS = %)
Available comp onen ts:
[1] "cluster" "centers" "totss" "withinss ...........
⑹"betweenss" "size" "iter" "ifault"
> table(iris$Species, kc$cluster)
1 2 3
setosa 0 50 0
versicolor 48 0 2
virgi nica 14 0 36
> plot( newiris[c("", "")], col = kc$cluster)
> poi nts(kc$ce nters[,c("", "")], col = 1:3, pch = 8, cex=2)
#K-Mediods 进行聚类分析
> ("cluster")
> library(cluster)
> <-pam(iris,3)
> table(iris$Species,$clusteri ng)
1 2 3
setosa 50 0 0
versicolor 0 3 47
virgi nica 0 49 1
> layout(matrix(c(1,2),1,2))
> plot
Q
45
50 55 flO 05 70 75 8D
Sepal.Length Llp-Zs
E
吕
Silhouette plot of pam(x = iris, k = 3) nwl50 3 AJSteis Cj j. i^ave^cj s ; l. 50 | O.6C 2 52 0.41 0.0 0.2 0.4 D.S 0.6 1.0 SilfKiuele widdl 〒 SiHowHie widWi - 0.57
?i
匸
Coirijjonenl 1
Tn®牌 TWO componerts explain &&.02 % of me
poini w
> layout(matrix(1))
[[2]]
#hc
> <-hclust( dist(iris[,1:4]))
> plot( , hang = -1)
> plclust( , labels = FALSE, ha ng = -1)
> re <- , k = 3)
> <-cutree, 3)
dist(iris[: 1:4]}
hclust 仁"complete")
#利用剪枝函数cutree()参数h 控制输出height=18时的系谱类别 > sapply (uniq ue,
+ fun ctio n(g)iris$Species[==g])
[[1]]
[1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[12] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[23] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[34] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[45] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
[1] versicolor versicolor versicolor versicolor versicolor versicolor versicolor
[8] versicolor versicolor versicolor versicolor versicolor versicolor versicolor
[15] versicolor versicolor versicolor versicolor versicolor versicolor versicolor
[22] versicolor versicolor virginica virginica virginica virginica virginica
[29] virginica virginica virginica virginica virginica virginica virginica
[36] virginica virginica virginica virginica virginica virginica virginica
[43] virginica virginica virginica virginica virginica virginica virginica
[50] virginica virginica virginica virginica virginica virginica virginica
[57] virginica virginica virginica virginica virginica virginica virginica
[64] virginica virginica virginica virginica virginica virginica virginica
[71] virginica virginica
Levels: setosa versicolor virginica
[[3]]
[1] versicolor versicolor versicolor versicolor versicolor versicolor versicolor
[8] versicolor versicolor versicolor versicolor versicolor versicolor versicolor
[15] versicolor versicolor versicolor versicolor versicolor versicolor versicolor
[22] versicolor versicolor versicolor versicolor versicolor versicolor virginica
Levels: setosa versicolor virginica
> plot
> ,k=4,border="light grey")# 用浅灰色矩形框出4分类聚类结果> ,k=3,border="dark grey")# 用浅灰色矩形框出3分类聚类结果> ,k=7,which=c(2,6),border="dark grey")
Cluiter Dendrogrtiim
# DBSCAN基于密度的聚类
> ("fpc")
> library(fpc)
> ds仁dbsca n(iris[,1:4],eps=1,Mi nPts=5)#
> ds1 dbsca n Pts=150 Mi nPts=5 eps=1
1 2 border 0 1
seed 50 99
total 50 100
> ds2=dbsca n(iris[,1:4],eps=4,Mi nPts=5)
> ds3=dbsca n(iris[,1:4],eps=4,Mi nPts=2)
> ds4=dbsca n(iris[,1:4],eps=8,Mi nPts=2)
> par(mfcol=c(2,2))
> plot(ds1,iris[,1:4],main="1: MinPts=5 eps=1")
> plot(ds3,iris[,1:4],main="3: MinPts=2 eps=4")
> plot(ds2,iris[,1:4],main="2: MinPts=5 eps=4")
> plot(ds4,iris[,1:4],main="4: MinPts=2 eps=8")
4: MinPts=2 eps=8
> d=dist(iris[,1:4])# 计算数据集的距离矩阵d
> max(d);min(d)#计算数据集样本的距离的最值
[1] 0
> ("ggpiot2")
> Iibrary(ggplot2)
2.G
3.G M 0 5 IF 2 5
半径参数为1,密度阈值为5
[1]
> in terval=cut_i nterval(d,30)
> table(i nterval)
interval
[0,],],],],] ,]
88 585 876 891 831 688 ,],],],] ,],]
543 369 379 339 335 406 ,],],],] ,],]
458 459 465 480 468 505 ,],],],] ,],]
349 385 321 291 187 138 ,],],],] ,],]
97 92 78 50 18 4
> (table(i nterval))
,]
4
> for(i in 3:5)
+ { for(j in 1:10)
+ { ds=dbsca n(iris[,1:4],eps=i,M in Pts=j) + prin t(ds)
+ }
+ }
dbscan Pts=150 Min Pts=1 eps=3
1
seed 150
total 150
dbscan Pts=150 Min Pts=2 eps=3
1
seed 150
total 150
dbscan Pts=150 Min Pts=3 eps=3
1
seed 150
total 150 dbscan Pts=150 MinPts=4 eps=3
1
seed 150
total 150
dbscan Pts=150 Min Pts=5 eps=3 1
seed 150
total 150
dbscan Pts=150 Min Pts=6 eps=3 1
seed 150
total 150
dbscan Pts=150 Min Pts=7 eps=3 1
seed 150
total 150
dbscan Pts=150 Min Pts=8 eps=3 1
seed 150
total 150
dbscan Pts=150 Min Pts=9 eps=3 1
seed 150
total 150
dbscan Pts=150 Mi nPts=10 eps=3 1
seed 150
total 150
dbscan Pts=150 Min Pts=1 eps=4 1
total 150
dbscan Pts=150 MinPts=2 eps=4 1
seed 150
total 150
seed 150
dbscan Pts=150 Min Pts=3 eps=4
1
seed 150
total 150
dbscan Pts=150 MinPts=4 eps=4
1
seed 150
total 150
dbscan Pts=150 MinPts=5 eps=4
1
seed 150
total 150
dbscan Pts=150 Min Pts=6 eps=4
1
seed 150
total 150
dbscan Pts=150 MinPts=7 eps=4
1
seed 150
total 150
dbscan Pts=150 Min Pts=8 eps=4
1
seed 150
total 150
dbscan Pts=150 Min Pts=9 eps=4
1
seed 150
total 150
dbscan Pts=150 Mi nPts=10 eps=4
1
seed 150
total 150 dbscan Pts=150 MinPts=1 eps=5 1
seed 150
total 150
dbscan Pts=150 Mi nPts=2 eps=5 1
seed 150
total 150
dbscan Pts=150 Mi nPts=3 eps=5 1
seed 150
total 150
dbsca n Pts=150 Mi nPts=4 eps=5 1
seed 150
total 150
dbscan Pts=150 Mi nPts=5 eps=5 1
seed 150
total 150
dbscan Pts=150 Mi nPts=6 eps=5 1
seed 150
total 150
dbsca n Pts=150 Mi nPts=7 eps=5 1
seed 150
total 150
dbscan Pts=150 Mi nPts=8 eps=5 1
seed 150
total 150
dbscan Pts=150 Min Pts=9 eps=5
1
seed 150
total 150
dbscan Pts=150 Mi nPts=10 eps=5
1
seed 150
total 150
#30次dbscan的聚类结果
> ds5=dbsca n(iris[,1:4],eps=3,Mi nPts=2)
> ds6=dbsca n(iris[,1:4],eps=4,Mi nPts=5)
> ds7=dbsca n(iris[,1:4],eps=5,Mi nPts=9)
> par(mfcol=c(1,3))
> plot(ds5,iris[,1:4],main="1: MinPts=2 eps=3")
> plot(ds6,iris[,1:4],main="3: MinPts=5 eps=4")
> plot(ds7,iris[,1:4],main="2: MinPts=9 eps=5")
2: MinPts=9 eps=5
2.G
3.G " 0 5 IE 2 5
4.5 S.5 6.5 7.5 12 3 4 5 6 7
# EM期望最大化聚类
> ("mclust")
> library(mclust)
> fit_EM=Mclust(iris[,1:4])
fitting ...
|===========================================================================|100% > summary(fit_EM)
Gaussian finite mixture model fitted by EM algorithm
Mclust VEV (ellipsoidal, equal shape) model with 2 comp onents: n df BIC ICL 150 26
Clusteri ng table:
1 2
50 100
> summary(fit_EM,parameters二TRUE)
Gaussian finite mixture model fitted by EM algorithm
Mclust VEV (ellipsoidal, equal shape) model with 2 comp onents: n df BIC ICL
150 26
Clusteri ng table:
1 2
50 100
Mixing probabilities:
1 2
Mea ns:
[,1] [,2]
Varia nces: [,,1]
0. 0.
0. 0.
[,,2]
0. 0.
0.
0. 0.
0.
> plot(fit_EM)# 对EM聚类结果作图Model-based clusteri ng plots:
1: BIC
2: classificati on
3: un certa inty
4: den sity
Selectio n: (下面显示选项)
#选1
O
■
品
I
吕
O
1 2545S789
Number of camponenla
#选2
25 3.0 3.5 A.Q
Sep
al.Length
舞B □日手
1卑
PEted 丄ength
•
Petal Width
聲才
■鼻
4 5 55 65 7.5 12 3 4 5 6 7
#选3
#选4
Selectio n: 0
> iris_BIC=mclustBIC(iris[,1:4])
fitting ...
|===========================================================================|100%
> iris_BICsum=summary(iris_BIC,data=iris[,1:4])
> iris_BICsum #获取数1据集iris 在各模型和类别数下的BIC 值 Best BIC values:
VEV,2 VEV,3 VVV,2
BIC BIC diff
Classification table for model (VEV,2): 1 2 50 100 > iris_BIC
Bayesian Information Criterion (BIC):
2.0 2.5
3.0 3.5
4.Q Q.B 1.0- 1.5 20 2 5
EII VII EEI VEI EVI VVI EEE
NA NA NA
NA
NA
3 models based on the BIC criterion: VEV,2 VEV,3 VVV,2
> par(mfcol=c(1,1))
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
Top
EVE VEE VVE EEV VEV EVV VVV
> plot(iris_BIC,G=1:7,col="yellow")
Number of components
> mclust2Dplot(iris[,1:2], + classificati on=iris_BICsum$classificati on, +
parameters=iris_BICsum$parameters,col="yellow")
o
1 2
3
4 5
6 7
I 5.0 5.5 60 6.5 7.0 7.5 0.0 In E
LaEAA ra d o lD
Sepal Length
> iris_De ns=de nsityMclust(iris[,1:2])# 对每一个样本进行密度估计 fitting ... > iris_De ns 'densityMclust' model object: (VEV,2) Available comp onents: [1] "call" "data" [5] "d" "G" [9] "loglik" "df" [13] "z"
"modelName" "n" "BIC" "bic" "hypvol" "parameters" "classification" "uncertainty" "density" > plot(iris_Dens,iris[,1:2],col="yellow",nlevels=55) ## Model-based den sity estimati on plots: 输入1或2
1: BIC 2: den sity Selectio n: (下面显示选
项)
#选1
o
o in
E E E V V V
V V E V E E ^V V
E V V E V E V
____________
E 日日 W VI E E V L U V E S L U
Number of components
#选2
o 寸
g el
“ g K 001 £p _
M -_u d a s o
5.
Sepal.Length
Selectio n: 0
> plot(iris_De ns,type = "persp",col = grey) Model-based den sity estimati on plots:
1: BIC
2: den sity
Selectio n: (下面显示选项)
#选1
o
o
in
C3
O 9
#选2
Selectio n:
--
H
H
E
E
vl
vl
L
L
J
E
V
E
V
E
^
-
E
E
E
V
V
V
V
V
E
V
E
E
^
-
V
E
V
W
E
V
E
y
12 3 4 5 6 7 8 9
Number of components。