Hierarchical Document Clustering using Frequent Itemsets 分层文件使用频繁项集聚类

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

The following competitors have been considered: UPGMA, bisecting k-means, and frequent itemsetbased algorithm (HFTC).
Conclusion
FIHC approach suggested in the article outperforms its competitors in terms of accuracy, efficiency, and scalability.
Deep hierarchy tree produced by other methods may not be suitable for browsing
A flat hierarchy reduces the number of navigation steps which in turn decreases the chance for a user to make mistakes
FIHC two main steps:
– Constructing Initial Clusters (construct an initial cluster to contain all the documents that contain each global frequent itemset)
The topic of a parent cluster is more general than the topic of a child cluster and they are “similar” to a certain degree.
Tree Structure vs Browsing
number of clusters is unknown size of the clusters varies greatly
Suggested approach - Frequent Itemsetbased Hierarchical Clusteronality High clustering accuracy Number of clusters as an optional input parameter Easy to browse with meaningful cluster description
high dimensionality high volume of data ease for browsing meaningful cluster labels
Problem statement
Problems with some standard clustering techniques:
Introduction
Application of document clustering:
web mining search engines information retrieval topological analysis
Special requirements for document clustering:
If a hierarchy is too flat, a parent topic may contain too many subtopics and it would increase the time and difficulty for the user to locate her target
Hierarchical Document Clustering using Frequent Itemsets
Benjamin C. M. Fung, Ke Wang, Martin Ester
SDM 2019
Presentation Serhiy Polyakov
DSCI 5240 Fall 2019
Building the Cluster Tree
The set of clusters produced by the previous stage can be viewed as a set of topics and subtopics in the document set. A cluster (topic) tree is constructed based on the similarity among clusters
– Making Clusters Disjoint (after this step, each document belongs to exactly one cluster)
Example of Disjoined clusters
Cluster label C(flow) C(flow) C(flow, layer) C(patient, treatment)
Algorithm
FIHC preprocessing steps:
– stop words removal – stemming on the document set – each document is represented by a vector of
frequencies of remaining items within the document
Documents in cluster cran.1, carn.2, cran.3, cran.4 ,cran.5 cisi.1 med.5 med.1, med.2, med.3, med.4, med.6
The cluster label is a set of mandatory items in the cluster in that every document in the cluster must contain all the items in the cluster label
A balance between depth and width of the tree is essential for browsing
Evaluation
Evaluation has been performed in terms of Fmeasure
The following parameters have been evaluated: sensitivity to parameters, efficiency and scalability.