数据挖掘实验报告

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

24
}
25 }
3. DBSCAN
算法思想:
首先在所有的点集中识别出 Core Point(对其ε邻域内点的个数进行计数), 再在剩余的点
集中识别出 Core Point(即该点在 Core Point 的ε邻域内).
接着, 若两个 Core Point 彼此相连, 他们是一个 Cluster 中的点, 将所有的 Core Point
} else { // If the cluster is not empty double newCentroidX = 0; double newCentroidY = 0; for ( int j = 0; j < numberOfPointsInCluster; ++ j ) { Point p = c.points.get(j); newCentroidX += p.x; newCentroidY += p.y; } newCentroidX /= numberOfPointsInCluster; newCentroidY /= numberOfPointsInCluster; Cluster newCluster = new Cluster( new Point(newCentroidX, newCentroidY)); newClusters[i] = newCluster;
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm.
It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers' points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.
private Cluster[] getClusters(int k, Point[] points, Cluster[] cluster) { for ( int i = 0; i < points.length; ++ i ) { Point currentPoint = points[i]; Cluster c = getClosestClusters( currentPoint, cluster); c.points.add(currentPoint); }
n – 2 个簇的 Group Average. 重复执行之前的步骤, 直至所有的簇都被合并.
程序流程图:
Page 4 of 10
Designed by 谢浩哲
哈尔滨工业大学
核心代码:
1 public class Agnes {
2
public Cluster getCluster(List<Cluster> clusters) {
一、实验内容
NOTE: 各算法的实现思想将在下一节阐述. 1. K-Means
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. 2. AGNES (层次聚类)
哈尔滨工业大学
数据挖掘理论与算法实验报告
(2015 年度秋季学期)
课程编码授课教师学生姓名学号学院
S1300019C 邹兆年谢浩哲
15S103172 计算机科学与技术学院
哈尔滨工业大学
NOTE: 本报告所涉及的全部代码均已在 GitHub 上开源: https://github.com/hzxie/Algorithm
Cluster[] newClusters = new Cluster[k]; for ( int i = 0; i < k; ++ i ) {
Cluster c = cluster[i]; int numberOfPointsInCluster = c.points.size();
if ( numberOfPointsInCluster == 0 ) { // If the cluster is empty int randomIndex = (int)(Math.random() * points.length); newClusters[i] = new Cluster(points[randomIndex]);
3
while ( clusters.size() > 1 ) {
4
double minProximity = Double.MAX_VALUE;
int minProximityIndex1 = 0, 5
minProximityIndex2 = 0;
6
7
for ( int i = 0; i < clusters.size(); ++ i ) {
6
7
Cluster[] clusters = getInitialClusters(k, points);
Fra Baidu bibliotek
8
Cluster[] newClusters = null;
9
do {
10
newClusters = getClusters(k, points, clusters);
11
12
if (isClustersTheSame(clusters, newClusters)) {
13
break;
Page 2 of 10
Designed by 谢浩哲
哈尔滨工业大学
14 15 16 17 18 19
20
21 22
23
24 25 26 27 28 29 30 31 32 33
34
35
36 37 38 39
40
41 42 43 44 45 46
48
49 50 51
} clusters = newClusters; } while (true); return clusters; }
Map<Point, Cluster> clusters = 4
getClustersOfCorePoints(corePoints, eps); 5
List<Point> borderPoints = 6
getBorderPoints(points, corePoints, minPoints, eps); getClustersOfBorderPoints( 7
1 public class Dbscan { public List<Cluster> getClusters(List<Point> points,
2 int minPoints, double eps) { List<Point> corePoints = getCorePoints(
3 points, minPoints, eps);
} }
Page 3 of 10
Designed by 谢浩哲
哈尔滨工业大学
52
return newClusters;
53
}
54 }
2. AGNES (层次聚类)
算法思想:
算法选用 Group Average 作为合并估量. 第一次循环选取 n 个点中 Group Average 最
小值进行合并, 将合并后的簇加入列表中, 移除之前的 2 个簇, 并重新计算该簇中的点与其他
AGNES, known as Agglomerative Hierarchical clustering. This algorithm works by grouping the data one by one on the basis of the nearest distance measure of all the pairwise distance between the data point. Again distance between the data point is recalculated but which distance to consider when the groups has been formed? For this there are many available methods. Some of them are: - Single-nearest distance or single linkage - Complete-farthest distance or complete linkage - Average-average distance or average linkage - Centroid distance - Ward's method - sum of squared Euclidean distance is minimized 3. DBSCAN
哈尔滨工业大学程序流程图:
核心代码:
1 public class KMeans {
2
public Cluster[] getClusters(int k, Point[] points) {
3
if ( k <= 0 || k >= points.length ) {
4
return null;
5
}
for ( int j = i + 1; 8
j < clusters.size(); ++ j ) { double proximity = getProximity( 9
clusters.get(i), clusters.get(j));
10
11
if ( proximity < minProximity ) {
合并成若干的 Cluster. 再检查所有的 Border Point, 看该 Border Point 在哪一个 Core Point
的ε邻域内, 将其合并至该 Core Point 所在的簇.
Page 5 of 10
Designed by 谢浩哲
哈尔滨工业大学程序流程图:
核心代码: 以下为该算法核心代码的实现(仅包含识别 Core Point, 并将 Core Point 分类成簇)
19
clusters.add(c);
20
clusters.remove(minProximityIndex2);
21
clusters.remove(minProximityIndex1);
22
}
return clusters.size() == 0 ? null : 23
clusters.get(0);
二、实验设计
1. K-Means 算法思想: 任意选取点集中的 k 个点作为中心, 对每一个点与 k 个中心进行对比, 划分至以这 k 个中心为中心点的簇中. 划分结束后, 重新计算每一个簇的中心点. 重复以上过程, 直至这些中心点不再变化.
Page 1 of 10
Designed by 谢浩哲
12
minProximity = proximity;
13
minProximityIndex1 = i;
14
minProximityIndex2 = j;
15
}
16
}
17
}
Cluster c = new Cluster(
18
clusters.get(minProximityIndex1),
clusters.get(minProximityIndex2));