Mining Quantitative Association Rules by Interval Clustering

合集下载

[复习]数据挖掘中的名词解释

[复习]数据挖掘中的名词解释

第一章1,数据挖掘(Data Mining),就是从存放在数据库,数据仓库或其他信息库中的大量的数据中获取有效的、新颖的、潜在有用的、最终可理解的模式的非平凡过程。

2,人工智能(Artificial Intelligence)它是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学。

人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。

3,机器学习(Machine Learning)是研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。

4,知识工程(Knowledge Engineering)是人工智能的原理和方法,对那些需要专家知识才能解决的应用难题提供求解的手段。

5,信息检索(Information Retrieval)是指信息按一定的方式组织起来,并根据信息用户的需要找出有关的信息的过程和技术。

6,数据可视化(Data Visualization)是关于数据之视觉表现形式的研究;其中,这种数据的视觉表现形式被定义为一种以某种概要形式抽提出来的信息,包括相应信息单位的各种属性和变量。

7,联机事务处理系统(OLTP)实时地采集处理与事务相连的数据以及共享数据库和其它文件的地位的变化。

在联机事务处理中,事务是被立即执行的,这与批处理相反,一批事务被存储一段时间,然后再被执行。

8, 联机分析处理(OLAP)使分析人员,管理人员或执行人员能够从多角度对信息进行快速一致,交互地存取,从而获得对数据的更深入了解的一类软件技术。

8,决策支持系统(dec ision support)是辅助决策者通过数据、模型和知识,以人机交互方式进行半结构化或非结构化决策的计算机应用系统。

它为决策者提供分析问题、建立模型、模拟决策过程和方案的环境,调用各种信息资源和分析工具,帮助决策者提高决策水平和质量。

chapter07FPAdvanced

chapter07FPAdvanced

Pattern Mining in Multi-Level, Multi-Dimensional Space
Mining Multi-Level Association Mining Multi-Dimensional Association Mining Quantitative Association Rules Mining Rare Patterns and Negative Patterns Constraint-Based Frequent Pattern Mining

Mining High-Dimensional Data and Colossal Patterns
Mining Compressed or Approximate Patterns Pattern Exploration and Application Summary
5
Mining Multiple-Level Association Rules

Discretized prior to mining using concept hierarchy.


Numeric values are replaced by ranges
In relational database, finding all frequenire k or k+1 table scans
()
(income)
(buys)
(age, income)
(age,buys) (income,buys)

(age,income,buys)
12
Quantitative Association Rules Based on Statistical Inference Theory [Aumann and Lindell@DMKD’03]

数据挖掘实验一关联规则挖掘

数据挖掘实验一关联规则挖掘

关联规则挖掘 Association Rule Mining【一】题目要求Data Description: The marketing department of a financial firm keeps records on customers, including demographic information and, number of type of accounts. When launching a new product, such as a "Personal Equity Plan" (PEP), a direct mail piece, advertising the product, is sent to existing customers, and a record kept as to whether that customer responded and bought the product. Based on this store of prior experience, the managers decide to use data mining techniques to build customer profile models. In this particular problem we are interested only in deriving (quantitative) association rules from the data (in a future assignment we will consider the use of classification.Your goal: perform Association Rule discovery on the data set. 具体的实验数据在bank-data.txt 文件中 【二】实现思路某财务公司生产了一种新产品,本题提供了600个客户的记录,对这些客户的不同属性进行数据挖掘。

第五章_关联规则挖掘

第五章_关联规则挖掘

Any subset of a frequent itemset must be frequent
Mining Frequent Itemsets: the Key Step
Find the frequent itemsets: the sets of
items that have minimum support A subset of a frequent itemset must also be a frequent itemset
Customer buys diaper
Find all the rules X & Y Z
with minimum confidence and support
support, s, probability that a transaction contains {X , Y , Z}
Customer buys beer
Use the frequent itemsets to generate association rules.
The Apriori Algorithm
Join Step: Ck is generated by joining Lk-1with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
[0.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%] Single dimension vs. multiple dimensional associations (see ex. Above) Single level vs. multiple-level analysis What brands of beers are associated with what brands of diapers? Various extensions Correlation, causality analysis

2-frequent-pattern--d

2-frequent-pattern--d
增加一个“level passage threshold”,如果父结点 不
频繁,但满足passing down 条件,则需要进一步 考察子结点。
“level passage threshold”:一般设置为当前层的 minisupp和下一层的minisupp之间的一个值。
8
去掉冗余的多层规则
例如:
Desktop Computer→b/w printer (support=0.08, confidence=0.7)
IBM Desktop Computer→b/w printer (support=0.02, confidence=0.72) (冗余)
如果一条规则的支持度ຫໍສະໝຸດ 可信度接近于它的预期值,则说它是冗余的。预期值是由它的祖先规则和子项目
在父项目中所占的比例决定的。例如:
上例中IBM Desktop Computer占Desktop Computer
比例为0.25的话,则“预期”支持度为0.02.
9
Mining Multi-Dimensional Association
Single-dimensional rules:
“bins”中。这些“bins”在数据挖掘的过程中可以进一 步结合起来。动态
3. 基于距离的AR:数值属性的离散化是为了捕捉数据间
的语义. 动态离散过程考虑数据点之间的距离,故称为基 于距离的AR
11
Mining Quantitative Associations
数值属性在挖掘之前被离散为以区间段表示的分类属性,如果需要的话分类 属性还可以用更一般的高层概念取代。
3
多层挖掘算法
❖自顶向下,逐层挖掘
使用统一的minisupp 优点: 简单 缺点: 如果minisupp 过大, 低层会丢失很多规则 如果minisupp 过小, 高层会产生很多无用规则

产品全生命周期价值计量与决策

产品全生命周期价值计量与决策

产品全生命周期价值计量与决策摘要:产品全生命周期的价值计量对于产品组合和定价等问题有着重要的辅助决策意义。

产品全生命周期价值包括长期价值,交叉销售价值,形象价值,商誉等。

通过建立产品会计价值分段拟合曲线可以预测产品的长期价值;通过对产品销售的历史数据进行关联规则的挖掘可以计算出产品的交叉销售价值系数。

根据产品的长期价值和交叉销售价值这两个维度,可以对产品进行分类,指导产品相关的决策。

关键词:产品全生命周期;长期价值;交叉销售;关联规则;数据挖掘一、引言企业之间的竞争,归根结底是产品和服务之间的竞争。

有时候,企业并不缺乏产品,而是缺乏拳头产品和有竞争力的产品组合,缺乏清晰和完善的产品战略和策略。

当企业有着数条产品线同时运作的时候,其资源的配置不可能采用平均主义,必然要有所倾斜,对于高价值产品、高附加值产品、高潜力产品和黑洞产品需要仔细分析,区别对待。

本文从产品全生命周期的角度,分析产品价值的构成,并提出了产品全生命周期价值(PLV,Product CycleLife Value)的计量模型,通过该模型,可以将企业产品进行分类,从而指导产品相关的决策。

二、产品全生命周期价值评估模型这里的产品全生命周期采用营销学上的概念,即一个产品从推出→成长→成熟→衰退→退出市场的整个周期。

产品全生命周期曲线如图1所示。

产品在其销售过程中,还会产生各种衍生价值,如交叉销售价值,商誉,形象价值。

产品的全生命周期价值可以用式(2)来表示:PLV=AV+CSV+CV+IV (2)(2)式中,AV为会计价值,CSV为交叉销售价值,CV为商誉,IV为形象价值。

CV和IV是产品的无形价值,难以量化。

本文的重点在于提供一种计量模型计算产品的会计价值和交叉销售价值,这两种价值可以通过企业销售的历史数据量化计算出来,并在此基础上评估产品的全生命周期价值,为企业提供产品决策依据。

三、产品长期价值的计量产品的价值包括已实现价值和潜在价值,已实现价值又称为当期价值;潜在价值又称为长期价值,指产品终生价值减去产品在当前时刻之前的净现金值的剩余值,可以采用分段函数来拟合产品的长期价值曲线。

关联规则挖掘AssociationRuleMining背景简介

关联规则挖掘AssociationRuleMining背景简介
– support ≥ minsup threshold – confidence ≥ minconf threshold

Brute-force approach:
– List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds Computationally prohibitive!

Apriori principle:
– If an itemset is frequent, then all of its subsets must also be frequent

Apriori principle holds due to the following property of the support measure:
TID Items
– k-itemset

An itemset that contains k items
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Frequent Itemset Generation
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE

AssociationRules.ppt

AssociationRules.ppt

Formal Model
I = i1, i2, …, im: set of literals (items) D : database of transactions T D : a transaction. T I
TID: unique identifier, associated with each T
Example Association Rule
90% of transactions that purchase bread and butter also purchase milk
Antecedent: bread and butter Consequent: milk Confidence factor: 90%
support( a)
Discovering Large Itemsets
Apriori and AprioriTid algorithms: Basic intuition: Any subset of a large itemset must be large
Itemset having k items can be generated by joining large itemsets having k-1 items, and deleting those that contain any subset that is not large. Def. k-itemset: large itemset with k items.
cቤተ መጻሕፍቲ ባይዱntain Y.
Rule X Y has a support s in D if s% of transactions in D contain X Y.
Example

关联规则挖掘(一):基本概念

关联规则挖掘(一):基本概念

关联规则挖掘(一):基本概念本文主要来自《数据仓库和数据挖掘》一书,这本书讲的和维基百科里的非常相似,怀疑是从某本外文书籍里翻译过来的。

关联规则挖掘(Association rule mining)是数据挖掘中最活跃的研究方法之一,可以用来发现事情之间的联系,最早是为了发现超市交易数据库中不同的商品之间的关系。

这里有一则沃尔玛超市的趣闻。

沃尔玛曾今对数据仓库中一年多的原始交易数据进行了详细的分析,发现与尿布一起被购买最多的商品竟然是啤酒。

借助数据仓库和关联规则,发现了这个隐藏在背后的事实:美国的妇女经常会嘱咐丈夫下班后为孩子买尿布,而30%~40%的丈夫在买完尿布之后又要顺便购买自己爱喝的啤酒。

根据这个发现,沃尔玛调整了货架的位置,把尿布和啤酒放在一起销售,大大增加了销量。

这里借用一个引例来介绍关联规则挖掘[1]。

表1 某超市的交易数据库交易号TID顾客购买的商品交易号TID顾客购买的商品T1bread, cream, milk, tea T6bread, teaT2bread, cream, milk T7beer, milk, teaT3cake, milk T8bread, teaT4milk, tea T9bread, cream, milk, teaT5bread, cake, milk T10bread, milk, tea定义五:关联规则是一个蕴含式:R:X⇒Y其中X⊂I,Y⊂I,并且X∩Y=⌀。

表示项集X在某一交易中出现,则导致Y以某一概率也会出现。

用户关心的关联规则,可以用两个标准来衡量:支持度和可信度。

定义六:关联规则R的支持度是交易集同时包含X和Y的交易数与|D|之比。

即:support(X⇒Y)=count(X⋃Y)/|D|支持度反映了X、Y同时出现的概率。

关联规则的支持度等于频繁集的支持度。

定义七:对于关联规则R,可信度是指包含X和Y的交易数与包含X的交易数之比。

即:confidence(X⇒Y)=support(X⇒Y)/support(X)可信度反映了如果交易中包含X,则交易包含Y的概率。

On the Complexity ofMiningQuantitative Association Rules

On the Complexity ofMiningQuantitative Association Rules
If a customer buys two, three, or four toothbrushes, then he/she also buys between three and six tubes of toothpaste.
which may be denoted:
Toothbrush :
Editor: Raymond Ng, Jiawei Han, and Laks Lakshmanan
interesting and important research problem. Recently, di erent aspects of the problem have been studied, and several algorithms have been presented in the literature, among others in (Srikant and Agrawal, 1996 Fukuda et al., 1996a Fukuda et al., 1996b Yoda et al., 1997 Miller and Yang, 1997). An aspect of the problem that has so far been ignored, is its computational complexity. In this paper, we study the computational complexity of mining quantitative association rules.
Pleinlaan 2, gebouw G-10, B-1050 Brussel Phone: +32-2-629.3308 • Fax: +32-2-629.3525

关联性规则是什么意思

关联性规则是什么意思

关联性规则是什么意思关联性规则(Association Rule)是一种数据挖掘算法,用于发现数据集中的频繁项集和它们之间的关联关系。

关联性规则可以用于分析数据之间的关联性,了解数据集中的隐藏模式和趋势,并从中提取有用的信息。

关联性规则通常以“IF-THEN”的形式表示,在这个规则中,IF部分称为前提(Antecedent),THEN部分称为结果(Consequent)。

例如,“IF 购买牛奶 THEN 购买面包”,这是一个简单的关联性规则。

最小支持度(Minimum Support)是指在整个数据集中出现频繁项集的最小概率阈值,限定了频繁项集的出现频率。

最小置信度(Minimum Confidence)是指关联规则的可靠性度量,它表示结果出现的条件下,前提出现的概率。

1.生成候选项集:候选项集是可能包含频繁项集的所有项集。

通过扫描数据集D,可以确定数据集D中单个项的出现频率,并生成频繁1-项集。

然后,通过组合频繁k-1项集,可以生成候选k项集,直到无法继续生成为止。

2.计算候选项集的支持度:支持度衡量了候选项集在整个数据集中出现的概率。

通过扫描数据集D,可以计算候选项集的支持度,并筛选出满足最小支持度阈值的频繁项集。

3.生成关联规则并计算置信度:生成频繁项集后,可以从中生成关联规则,计算关联规则的置信度。

关联规则的置信度表示结果在前提条件下出现的概率。

只有满足最小置信度阈值的关联规则才会被认为是有意义和可靠的。

总而言之,关联性规则是通过发现频繁项集和计算支持度和置信度来揭示数据集中的关联关系。

它是一种强大的数据挖掘工具,可以用于分析数据之间的关联性,发现隐藏的模式和趋势,并提取有用的信息。

Integrating Classification and Association Rule Mining

Integrating Classification and Association Rule Mining

Integrating Classification and Association Rule Mining Bing Liu Wynne Hsu Yiming MaDepartment of Information Systems and Computer ScienceNational University of SingaporeLower Kent Ridge Road, Singapore 119260{liub, whsu, mayiming}@.sgAbstractClassification rule mining aims to discover a small set of rules in the database that forms an accurate classifier.Association rule mining finds all the rules existing in the database that satisfy some minimum support and minimum confidence constraints. For association rule mining, the target of discovery is not pre-determined, while for classification rule mining there is one and only one pre-determined target. In this paper, we propose to integrate these two mining techniques. The integration is done by focusing on mining a special subset of association rules, called class association rules (CARs). An efficient algorithm is also given for building a classifier based on the set of discovered CARs. Experimental results show that the classifier built this way is, in general, more accurate than that produced by the state-of-the-art classification system C4.5. In addition, this integration helps to solve a number of problems that exist in the current classification systems.IntroductionClassification rule mining and association rule mining are two important data mining techniques. Classification rule mining aims to discover a small set of rules in the database to form an accurate classifier (e.g., Quinlan 1992; Breiman et al 1984). Association rule mining finds all rules in the database that satisfy some minimum support and minimum confidence constraints (e.g., Agrawal and Srikant 1994). For association rule mining, the target of mining is not pre-determined, while for classification rule mining there is one and only one pre-determined target, i.e., the class. Both classification rule mining and association rule mining are indispensable to practical applications. Thus, great savings and conveniences to the user could result if the two mining techniques can somehow be integrated. In this paper, we propose such an integrated framework, called associative classification. We show that the integration can be done efficiently and without loss of performance, i.e., the accuracy of the resultant classifier.The integration is done by focusing on a special subset of association rules whose right-hand-side are restricted to the classification class attribute. We refer to this subset of Copyright © 1998, American Association for Artificial Intelligence(). All rights reserved.rules as the class association rules (CARs). An existingassociation rule mining algorithm (Agrawal a nd Srikant1994) is adapted to mine all the CARs that satisfy theminimum support and minimum confidence constraints.This adaptation is necessary for two main reasons:1.Unlike a transactional database normally used inassociation rule mining (Agrawal a nd Srikant 1994) thatdoes not have many associations, classification datatends to contain a huge number of associations.Adaptation of the existing association rule miningalgorithm to mine only the CARs is needed so as toreduce the number of rules generated, thus avoidingcombinatorial explosion (see the evaluation section).2.Classification datasets often contain many continuous(or numeric) attributes. Mining of association rules withcontinuous attributes is still a major research issue(Srikant and Agrawal 1996; Yoda et al 1997; Wang,Tay and Liu 1998). Our adaptation involves discretizingcontinuous attributes based on the classification pre-determined class target. There are many gooddiscretization algorithms for this purpose (Fayyad andIrani 1993; Dougherty, Kohavi and Sahami 1995). Data mining in the proposed associative classificationframework thus consists of three steps:·discretizing continuous attributes, if any ·generating all the class association rules (CARs), and ·building a classifier based on the generated CARs. This work makes the following contributions:1.It proposes a new way to build accurate classifiers.Experimental results show that classifiers built thisway are, in general, more accurate than thoseproduced by the state-of-the-art classification systemC4.5 (Quinlan 1992).2.It makes association rule mining techniquesapplicable to classification tasks.3.It helps to solve a number of important problemswith the existing classification systems.Let us discuss point 3 in greater detail below:·The framework helps to solve the understandability problem (Clark and Matwin 1993; Pazzani, Mani and Shankle 1997) in classification rule mining. Many rules produced by standard classification systems are difficult to understand because these systems use domain independent biases and heuristics to generate a small set of rules to form a classifier. These biases, Appeared in KDD-98, New York, Aug 27-31, 1998however, may not be in agreement with the existing knowledge of the human user, thus resulting in many generated rules that make no sense to the user, while many understandable rules that exist in the data are left undiscovered. With the new framework, the problem of finding understandable rules is reduced to a post-processing task (since we generate all the rules).Techniques such as those in (Liu and Hsu 1996; Liu, Hsu and Chen 1997) can be employed to help the user identify understandable rules.· A related problem is the discovery of interesting or useful rules. The quest for a small set of rules of the existing classification systems results in many interesting and useful rules not being discovered. For example, in a drug screening application, the biologists are very interested in rules that relate the color of a sample to its final outcome. Unfortunately, the classification system (we used C4.5) just could not find such rules even though such rules do exist as discovered by our system.·In the new framework, the database can reside on disk rather than in the main memory. Standard classification systems need to load the entire database into the main memory (e.g., Quinlan 1992), although some work has been done on the scaling up of classification systems (Mahta, Agrawal and Rissanen 1996).Problem StatementOur proposed framework assumes that the dataset is a normal relational table, which consists of N cases described by l distinct attributes. These N cases have been classified into q known classes. An attribute can be a categorical (or discrete) or a continuous (or numeric) attribute.In this work, we treat all the attributes uniformly. For a categorical attribute, all the possible values are mapped to a set of consecutive positive integers. For a continuous attribute, its value range is discretized into intervals, and the intervals are also mapped to consecutive positive integers. With these mappings, we can treat a data case as a set of (attribute, integer-value) pairs and a class label. We call each (attribute, integer-value) pair an item. Discretization of continuous attributes will not be discussed in this paper as there are many existing algorithms in the machine learning literature that can be used (see (Dougherty, Kohavi and Sahami 1995)).Let D be the dataset. Let I be the set of all items in D, and Y be the set of class labels. We say that a data case d ÎD contains XÍI, a subset of items, if XÍd. A class association rule (CAR) is an implication of the form X®y, where XÍI, and yÎY. A rule X®y holds in D with confidence c if c% of cases in D that contain X are labeled with class y. The rule X®y has support s in D if s% of the cases in D contain X and are labeled with class y.Our objectives are (1) to generate the complete set of CARs that satisfy the user-specified minimum support (called minsup) and minimum confidence (called minconf) constraints, and (2) to build a classifier from the CARs.Generating the Complete Set of CARsThe proposed algorithm is called algorithm CBA (Classification Based on Associations). It consists of two parts, a rule generator (called CBA-RG), which is based on algorithm Apriori for finding association rules in (Agrawal and Srikant 1994), and a classifier builder (called CBA-CB). This section discusses CBA-RG. The next section discusses CBA-CB.Basic concepts used in the CBA-RG algorithm The key operation of CBA-RG is to find all ruleitems that have support above minsup. A ruleitem is of the form: <condset, y>where condset is a set of items, yÎY is a class label. The support count of the condset (called condsupCount) is the number of cases in D that contain the condset. The support count of the ruleitem (called rulesupCount) is the number of cases in D that contain the condset and are labeled with class y. Each ruleitem basically represents a rule:condset ®y,whose support is (rulesupCount / |D|) *100%, where |D| is the size of the dataset, and whose confidence is (rulesupCount / condsupCount)*100%.Ruleitems that satisfy minsup are called frequent ruleitems, while the rest are called infrequent ruleitems. For example, the following is a ruleitem:<{(A, 1), (B, 1)}, (class, 1)>,where A and B are attributes. If the support count of the condset {(A, 1), (B, 1)} is 3, the support count of the ruleitem is 2, and the total number of cases in D is 10, then the support of the ruleitem is 20%, and the confidence is 66.7%. If minsup is 10%, then the ruleitem satisfies the minsup criterion. We say it is frequent.For all the ruleitems that have the same condset, the ruleitem with the highest confidence is chosen as the possible rule (PR) representing this set of ruleitems. If there are more than one ruleitem with the same highest confidence, we randomly select one ruleitem. For example, we have two ruleitems that have the same condset:1.<{(A, 1), (B, 1)}, (class: 1)>.2. <{(A, 1), (B, 1)}, (class: 2)>.Assume the support count of the condset is 3. The support count of the first ruleitem is 2, and the second ruleitem is 1. Then, the confidence of ruleitem 1 is 66.7%, while the confidence of ruleitem 2 is 33.3% With these two ruleitems, we only produce one PR (assume |D| = 10): (A, 1), (B, 1) ® (class, 1) [supt = 20%, confd= 66.7%] If the confidence is greater than minconf, we say the rule is accurate. The set of class association rules (CARs) thus consists of all the PRs that are both frequent and accurate. The CBA-RG algorithmThe CBA-RG algorithm generates all the frequent ruleitems by making multiple passes over the data. In the first pass, it counts the support of individual ruleitem and determines whether it is frequent. In each subsequent pass, it starts with the seed set of ruleitems found to be frequentin the previous pass. It uses this seed set to generate new possibly frequent ruleitems , called candidate ruleitems .The actual supports for these candidate ruleitems are calculated during the pass over the data. At the end of the pass, it determines which of the candidate ruleitems are actually frequent. From this set of frequent ruleitems , it produces the rules (CARs).Let k-ruleitem denote a ruleitem whose condset has k items. Let F k denote the set of frequent k-ruleitems . Each element of this set is of the following form:<(condset , condsupCount ), (y , rulesupCount )>.Let C k be the set of candidate k-ruleitems . The CBA-RG algorithm is given in Figure 1.1F 1 = {large 1-ruleitems};2CAR 1 = genRules(F 1);3prCAR 1 = pruneRules(CAR 1); 4for (k = 2; F k-1 ¹ Æ; k ++) do 5C k = candidateGen(F k -1);6for each data case d Î D do 7C d = ruleSubset(C k , d );8for each candidate c Î C d do 9c .condsupCount++;10 if d .class = c .class then c .rulesupCount++11end 12end 13F k = {c Î C k | c .rulesupCount ³ minsup };14CAR k = genRules(F k );15prCAR k = pruneRules(CAR k );16 end17 CARs = 7k CAR k ;18prCARs = 7k prCAR k ;Figure 1: The CBA-RG algorithmLine 1-3 represents the first pass of the algorithm. It counts the item and class occurrences to determine the frequent 1-ruleitems (line 1). From this set of 1-ruleitems , a set of CARs (called CAR 1) is generated by genRules (line 2) (see previous subsection). CAR 1 is subjected to a pruning operation (line 3) (which can be optional). Pruning is also done in each subsequent pass to CAR k (line 15). The function pruneRules uses the pessimistic error rate based pruning method in C4.5 (Quinlan 1992). It prunes a rule as follows: If rule r ’s pessimistic error rate is higher than the pessimistic error rate of rule r - (obtained by deleting one condition from the conditions of r ), then rule r is pruned.This pruning can cut down the number of rules generated substantially (see the evaluation section).For each subsequent pass, say pass k , the algorithm performs 4 major operations. First, the frequent ruleitems F k-1 found in the (k -1)th pass are used to generate the candidate ruleitems C k using the condidateGen function (line 5). It then scans the database and updates various support counts of the candidates in C k (line 6-12). After those new frequent ruleitems have been identified to form F k (line 13), the algorithm then produces the rules CAR k using the genRules function (line 14). Finally, rule pruning is performed (line 15) on these rules.The candidateGen function is similar to the function Apriori-gen in algorithm Apriori. The ruleSubset functiontakes a set of candidate ruleitems C k and a data case d to find all the ruleitems in C k whose condsets are supported by d . This and the operations in line 8-10 are also similar to those in algorithm Apriori. The difference is that we need to increment the support counts of the condset and the ruleitem separately whereas in algorithm Apriori only one count is updated. This allows us to compute the confidence of the ruleitem . They are also useful in rule pruning.The final set of class association rules is in CARs (line 17). Those remaining rules after pruning are in prCARs (line 18).Building a ClassifierThis section presents the CBA-CB algorithm for building a classifier using CARs (or prCARs). To produce the best classifier out of the whole set of rules would involve evaluating all the possible subsets of it on the training data and selecting the subset with the right rule sequence thatgives the least number of errors. There are 2msuch subsets,where m is the number of rules, which can be more than 10,000, not to mention different rule sequences. This is clearly infeasible. Our proposed algorithm is a heuristic one. However, the classifier it builds performs very well as compared to that built by C4.5. Before presenting the algorithm, let us define a total order on the generated rules.This is used in selecting the rules for our classifier.Definition : Given two rules, r i and r j , r i B r j (also called r iprecedes r j or r i has a higher precedence than r j ) if 1. the confidence of r i is greater than that of r j , or2. their confidences are the same, but the support of r iis greater than that of r j , or3. both the confidences and supports of r i and r j are the same, but r i is generated earlier than r j ;Let R be the set of generated rules (i.e., CARs or pCARs),and D the training data. The basic idea of the algorithm is to choose a set of high precedence rules in R to cover D .Our classifier is of the following format:<r 1, r 2, …, r n , default_class >,where r i Î R , r a B r b if b > a . default_class is the default class. In classifying an unseen case, the first rule that satisfies the case will classify it. If there is no rule that applies to the case, it takes on the default class as in C4.5.A naive version of our algorithm (called M1) for building such a classifier is shown in Figure 2. It has three steps:Step 1 (line 1): Sort the set of generated rules R according to the relation “B ”. This is to ensure that we will choose the highest precedence rules for our classifier.Step 2 (line 2-13): Select rules for the classifier from R following the sorted sequence. For each rule r , we go through D to find those cases covered by r (they satisfy the conditions of r ) (line 5). We mark r if it correctly classifies a case d (line 6). d .id is the unique identification number of d . If r can correctly classify at least one case (i.e., if r is marked), it will be a potential rule in our classifier (line 7-8). Those cases it covers are then removed from D (line 9). A default class is also selected (the majority class in the remaining data),which means that if we stop selecting more rules for our classifier C this class will be the default class of C (line10). We then compute and record the total number oferrors that are made by the current C and the default class (line 11). This is the sum of the number of errors that have been made by all the selected rules in C and the number of errors to be made by the default class in the training data. When there is no rule or no training case left, the rule selection process is completed.Step 3 (line 14-15): Discard those rules in C that do not improve the accuracy of the classifier. The first rule at which there is the least number of errors recorded on D is the cutoff rule. All the rules after this rule can be discarded because they only produce more errors. The undiscarded rules and the default class of the last rule inC form our classifier.1R = sort(R);2for each rule rÎR in sequence do3temp = Æ;4for each case dÎD do5if d satisfies the conditions of r then6store d.id in temp and mark r if it correctly classifies d;7if r is marked then8insert r at the end of C;9 delete all the cases with the ids in temp from D; 10selecting a default class for the current C;11compute the total number of errors of C;12end13end14Find the first rule p in C with the lowest total number of errors and drop all the rules after p in C;15Add the default class associated with p to end of C, and return C (our classifier).Figure 2. A naïve algorithm for CBA-CB: M1This algorithm satisfies two main conditions:Condition 1. Each training case is covered by the rule with the highest precedence among the rules that can cover the case. This is so because of the sorting done in line 1. Condition 2. Every rule in C correctly classifies at least one remaining training case when it is chosen. This is so due to line 5-7.This algorithm is simple, but is inefficient especially when the database is not resident in the main memory because it needs to make many passes over the database. Below, we present an improved version of the algorithm (called M2), whereby only slightly more than one pass is made over D. The key point is that instead of making one pass over the remaining data for each rule (in M1), we now find the best rule in R to cover each case. M2 consists of three stages (see (Liu, Hsu and Ma 1998) for more details): Stage 1. For each case d, we find both the highest precedence rule (called cRule) that correctly classifies d, and also the highest precedence rule (called wRule) that wrongly classifies d. If cRule B wRule, the case should be covered by cRule. This satisfies Condition 1and 2 above. We also mark the cRule to indicate that it classifies a case correctly. If wRule B cRule, it is more complex because we cannot decide which rule among the two or some other rule will eventually cover d. In order to decide this, for each d with wRule B cRule, we keep a data structure of the form: <dID, y, cRule, wRule>, where dID is the unique identification number of the case d, y is the class of d. Let A denote the collection of <dID, y, cRule, wRule>’s, U the set of all cRule s, and Q the set of cRule s that have a higher precedence than their corresponding wRule s. This stage of the algorithm is shown in Figure 3.The maxCoverRule function finds the highest precedence rule that covers the case d. Cc(or Cw) is the set of rules having the same (or different) class as d. d.id and d.class represent the identification number and the class of d respectively. For each cRule, we also remember how many cases it covers in each class using the field classCasesCovered of the rule.1Q = Æ; U = Æ; A = Æ;2for each case dÎD do3cRule = maxCoverRule(Cc, d);4wRule = maxCoverRule(Cw, d);5U = UÈ {cRule};6cRule.classCasesCovered[d.class]++;7if cRule B wRule then8Q = Q È {cRule};9mark cRule;10else A = AÈ <d.id, d.class, cRule, wRule>11endFigure 3: CBA-CB: M2 (Stage 1)Stage 2. For each case d that we could not decide which rule should cover it in Stage 1, we go through d again to find all rules that classify it wrongly and have a higher precedence than the corresponding cRule of d (line 5 in Figure 4). That is the reason we say that this method makes only slightly more than one pass over D. The details are as follows (Figure 4):1for each entry <dID, y, cRule, wRule> ÎA do2if wRule is marked then3cRule.classCasesCovered[y]--;4wRule.classCasesCovered[y]++;5else wSet = allCoverRules(U, dID.case, cRule);6for each rule wÎwSet do7w.replace = w.replace È {<cRule, dID, y>}; 8w.classCasesCovered[y]++;9end10Q = QÈwSet11end12endFigure 4: CBA-CB: M2 (Stage 2)If wRule is marked (which means it is the cRule of at least one case) (line 2), then it is clear that wRule will cover the case represented by dID. This satisfies the two conditions. The numbers of data cases, that cRule and wRule cover, need to be updated (line 3-4). Line 5 finds all the rules that wrongly classify the dID case and have higher precedences than that of its cRule (note that we only need to use the rules in U). This is done by the function allCoverRules. The rules returned are those rules that may replace cRule to cover the case because they have higher precedences. Weput this information in the replace field of each rule (line 7). Line 8 increments the count of w.classCasesCovered[y] to indicate that rule w may cover the case. Q contains all the rules to be used to build our classifier.Stage 3. Choose the final set of rules to form our classifier (Figure 5). It has two steps:Step 1 (line 1-17): Choose the set of potential rules to form the classifier. We first sort Q according to the relation “B”. This ensures that Condition1 above is satisfied in the final rule selection. Line 1 and 2 are initializations.The compClassDistr function counts the number of training cases in each class (line 1) in the initial training data. ruleErrors records the number of errors made so far by the selected rules on the training data.In line 5, if rule r no longer correctly classifies any training case, we discard it. Otherwise, r will be a rule in our classifier. This satisfies Condition2. In line 6, r will try to replace all the rules in r.replace because r precedes them. However, if the dID case has already been covered by a previous rule, then the current r will not replace rul to cover the case. Otherwise, r will replace rul to cover the case, and the classCasesCovered fields of r and rul are updated accordingly (line 7-9).For each selected rule, we update ruleErrors and classDistr (line 10-11). We also choose a default class(i.e., defaultClass), which is the majority class in theremaining training data, computed using classDistr (line12). After the default class is chosen, we also know thenumber of errors (called defaultError) that the default class will make in the remaining training data (line 13).The total number of errors (denoted by totalErrors) that the selected rules in C and the default class will make is ruleErrors + defaultErrors (line 14).Step 2 (line 18-20): Discard those rules that introduce more errors, and return the final classifier C (this is the same as in M1).1classDistr = compClassDistri(D);2ruleErrors = 0;3Q = sort(Q);4for each rule r in Q in sequence do5if r.classCasesCovered[r.class] ¹ 0 then6for each entry <rul, dID, y> in r.replace do7if the dID case has been covered by aprevious r then8r.classCasesCovered[y]--;9else rul.classCasesCovered[y]--;10ruleErrors = ruleErrors + errorsOfRule(r);11classDistr = update(r, classDistr);12defaultClass = selectDefault(classDistr);13defaultErrors = defErr(defaultClass, classDistr); 14totalErrors = ruleErrors + defaultErrors;15Insert <r, default-class, totalErrors> at end of C 16end17end18Find the first rule p in C with the lowest totalErrors, and then discard all the rules after p from C;19Add the default class associated with p to end of C;20Return C without totalErrors and default-class;Figure 5: CBA-CB: M2 (Stage 3)Empirical EvaluationWe now compare the classifiers produced by algorithm CBA with those produced by C4.5 (tree and rule) (Release 8). We used 26 datasets from UCI ML Repository (Merz and Murphy 1996) for the purpose. The execution time performances of CBA-RG and CBA-CB are also shown.In our experiments, minconf is set to 50%. For minsup, it is more complex. minsup has a strong effect on the quality of the classifier produced. If minsup is set too high, those possible rules that cannot satisfy minsup but with high confidences will not be included, and also the CARs may fail to cover all the training cases. Thus, the accuracy of the classifier suffers. From our experiments, we observe that once minsup is lowered to 1-2%, the classifier built is more accurate than that built by C4.5. In the experiments reported below, we set minsup to 1%. We also set a limit of 80,000 on the total number of candidate rules in memory (including both the CARs and those dropped-off rules that either do not satisfy minsup or minconf). 16 (marked with a * in Table 1) of the 26 datasets reported below cannot be completed within this limit. This shows that classification data often contains a huge number of associations.Discretization of continuous attributes is done using the Entropy method in (Fayyad and Irani 1993). The code is taken from MLC++ machine learning library (Kohavi et al 1994). In the experiments, all C4.5 parameters had their default values. All the error rates on each dataset are obtained from 10-fold cross-validations. The experimental results are shown in Table 1. The execution times here are based on datasets that reside in the main memory. Column 1: It lists the names of the 26 datasets. See (Liu, Hsu and Ma 1998) for the description of the datasets. Column 2: It shows C4.5rules’ mean error rates over ten complete 10-fold cross-validations using the original datasets (i.e., without discretization). We do not show C4.5 tree’s detailed results because its average error rate over the 26 datasets is higher (17.3).Column 3: It shows C4.5rules’ mean error rate after discretization. The error rates of C4.5 tree are not used here as its average error rate (17.6) is higher.Column 4: It gives the mean error rates of the classifiers built using our algorithm with minsup = 1% over the ten cross-validations, using both CARs and infrequent rules (dropped off rules that satisfy minconf). We use infrequent rules because we want to see whether they affect the classification accuracy. The first value is the error rate of the classifier built with rules that are not subjected to pruning in rule generation, and the second value is the error rate of the classifier built with rules that are subjected to pruning in rule generation. Column 5: It shows the error rates using only CARs in our classifier construction without or with rule pruning (i.e., prCARs) in rule generation.It is clear from these 26 datasets that CBA produces more accurate classifiers. On average, the error rate decreases from 16.7% for C4.5rules (without discretization) to 15.6-15.8% for CBA. Furthermore, our system is superior to。

数据挖掘教程-周志华

数据挖掘教程-周志华

Two-step process of association rule mining
• step1: find all frequent itemsets • step2: generate strong association rules from the frequent itemsets
step2 is easier, therefore most works on association rule mining focus on step1
Introduction to Data Mining
Chapter 4 Association Spring 2011
Zhi-Hua Zhou Department of Computer Science & Technology Nanjing University
What is association rule?
an association rule is of the form A B, where A , B , and A B = the rule holds in DB with support(A B) = P (A and B) confidence (A B) = P (B | A)
example: buys(X, “diapers”) buys(X, “beers”) [0.5%, 60%] major(X, “CS”) takes(X, “DB”) grade(X, “A”) [1%, 75%]
Formal definition of association rule
for (k = 1; Lk = O; k++) {
Ck+1 = apriori_gen(Lk, min_sup); // generate candidate frequent (k+1)-itemsets for each transaction t DB { // scan DB for counting Ct = subset (Ck+1, t); // get the subsets of t that are candidates

Mining Association Rules with Weighted Items

Mining Association Rules with Weighted Items
TID Bar codes TID Bar codes 1 1245 2 145 3 245 4 1245 5 135 6 245 7 2345
2 Weighted Association Rules
n
Similar to 1 and 5 , we conof transactions D, and a set of items I = fi1 ; i2; :::; i g. Each transaction is a subset of I , and is assigned a transaction identi er hTIDi.
De nition 2 The support of the association rule X Y is the probability that X Y exists in a transaction in the database D. De nition 3 The con dence of the association rule X Y is the probability that Y exists given that a
1 Introduction
In this paper, we introduce the notion of weighted items to represent the importance of individual items. In a retailing business, a marketing manager may want to mine the association rules with more emphasis on some particular products in mind, and less emphasis on other products. For example, some products may be under promotion and hence are more interesting, or some products are more pro table and hence rules concerning them are of greater values. This results in a generalized version of association rule mining problem, which we call weighted association rule mining. For example, if the pro t of the sofa is much higher than the bed, then the rule Buys Pillow Buys Sofa is more interesting than Buys Pillow Buys Bed When we compute the weighted support of the rule, we can consider both the support and the important ratio weights factors. A simple attempt to solve the problem is to eliminate the entries of items with small weights. However, a rule for a heavy weighted item may also consist of low weighted items. For example, we may be promoting a product A, and nd that it is a ected by another product B, for which we have initially no interest. Hence the simple approach does not work in this case. Another approach is adopting the existing fast algorithms for nding binary association rules, such as the Apriori Gen Algorithm 1 . Such algorithms depend on the downward closure property which governs that subsets of a large itemset are also large. However, it is not true for the weighted case in our de nition, and the Apriori Algorithm cannot be applied. In this paper, we propose new algorithms to mine weighted binary association rules. Two algorithms, MINWALO and MINWALW are designed for this purpose. Experimental result shows

AssociationRuleMining:关联规则的挖掘

AssociationRuleMining:关联规则的挖掘
• Apriori pruning principle: If there is any pattern which is infrequent, its superset should not be generated/tested!
• Method (level-wise search): – Initially, scan DB once to get frequent 1-itemset – For each level k: • Generate length (k+1) candidates from length k frequent patterns • Scan DB and remove the infrequent candidates – Terminate when no candidate set can be generated
Iyad Batal
Apriori
• The Apriori property: – Any subset of a frequent pattern must be frequent. – If {beer, chips, nuts} is frequent, so is {beer, chips}, i.e., every transaction having {beer, chips, nuts} also contains {beer, chips}.
Iyad Batal
Association Rules
• A Frequent pattern is a pattern (a set of items, subsequences, subgraphs, etc.) that occurs frequently in a data set.
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Journal of Computational Information Systems4:2(2008) 609-616Available at Mining Quantitative Association Rules by Interval ClusteringYunfei YIN1†, Zhi ZHONG2, Yingxun WANG11 Department of Automatic Control, Beijing University of Aeronautics and Astronautics, Beijing 100083, China2 Department of Mathematics and Computer Science, Guangxi Teacher’s College, Nanning 530001, ChinaAbstractFor complex information processing, three novel algorithms related with quantitative association rules are proposed, which are value-interval clustering mining, interval-interval clustering mining and matrix-interval clustering mining. By comparison with the Apriori association rule mining, the interval approach has more practical interesting, especially to the inaccurate information. The new approach is a valuable attempt for solving the complex information processing issue, and provides many possible techniques to improve it. After introducing the concepts of value-interval clustering, interval-interval clustering, and matrix-interval clustering, the general interval modeling approach to data mining is stated with some examples added. Finally, some classical datasets are tested and the experimental results show the feasibility of the new approach.Keywords: Interval Clustering; Quantitative Association Rule; Data Mining1.IntroductionSince Agrawal et al. (1993) proposed the problem of mining association rules [1], it has been a fairly active branch in data mining. The motivation of association rule mining is to find how the items bought in a consumer basket related to each other. Given a set of transactions, where each transaction is a set of items, an association rule is an expression of the form X → Y, where X and Y are sets of items. An example of an association rule is: "30% of transactions that contain beer also contain diapers; 2% of all transactions contain both of these items". Here 30% is called the confidence of the rule, and 2% the support of the rule. The problem is to find all association rules that satisfy user-specified minimum support and minimum confidence constraints.The motivation of our interval approach to data mining is to find the interval relationships between items,e.g., 50 percent of teachers in college aged 40 to 65 and earned 8,000 to 100,000 dollars each year own 2 to4 cars. Because the half-baked and inaccurate information always exists in real-world, which makes many similar cases can be seen in our surroundings.Related work. For the traditional data mining, there are many extensions to enhance its efficiency and find more useful patterns, such as references [2-8]. A fast algorithm for mining Boolean association rules† Corresponding author.Email addresses: yinyunfei@ (Yunfei YIN), zhong8662@ (Zhi ZHONG).1553-9105/ Copyright © 2008 Binary Information PressFebruary, 2008610 Y.Yin et al. /Journal of Computational Information Systems 4:2 (2008) 609-616was proposed by Agrawal et al. which can be used to solve commodities arrangement in supermarket. Savasere et al. (1998, [9]) also discussed how to find strong negative association rules in the context of taxonomies of items, which is focused on finding rules which negatively correlate with rules.In order to find more valuable association rules, Lent et al. have mined out clustering association rules by clustering way (Miller&Yang 1997, [10]). However, because of the half-baked information and the ambiguity between different objects always exist, which make it possible for using some intervals to represent an attribute. Under such context, we propose the interval approach to handle the inaccurate information. Furthermore, a modeling method is introduced and formalized, which is suitable for mining inaccurate database information.The rest of the paper is organized as follows: in the following section, we will introduce some concepts and problems of quantitative association rules mining, and then three interval clustering methods are described in detail. In section 3, two interval mining ways are offered to handle normal database and interval database respectively. In section 4, some experimental results about some classical datasets are provided. A brief conclusion about the research will be given in section 5.2. Interval Clustering MethodsInterval theory gets its origin from computational mathematics, and has a widely use in many fields such as control engineering, electronic commerce (Hu&Xu&Yang 2003, [11]). The interval-based data mining is also a very important application about interval theory. In this section we propose three methods about interval data processing.2.1. Value-interval Clustering MethodSuppose n x x x ,...,,21 is n objects, whose actions are characterized by some interval values. According to the traditional clustering similarity formula, we can get correlative similarity matrix:⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡=+−+−+−1...],[],[.........1],[1)(22112121,n n n n j i t t t t t t r R , Where matrix R is a symmetry matrix, and ],[+−=ij ij ij t t r is the similarity between i x and j x , i, j = 1, 2, …,n.We propose a useful value-interval clustering model as follows:(1) Netting: in R, if 0λ>−ij t (the threshold 0λ is offered by domain experts or the users), the element],[+−ij ij t t is replaced by “×”; if 0λ<+ij t , the element ],[+−ij ij t t is replaced by space; if ],[0+−∈ij ij t t λ theelement is replaced by “#”.Call “×” as a node, and “#” as a similar node. We firstly drag a longitude and latitude from the node toY.Yin et al. /Journal of Computational Information Systems 4:2 (2008) 609-616 611diagonal, and use dash lines to draw longitude and latitude from the similar node to diagonal as described in figure 1.(2) Relatively certain classification: For each node denoted by ×, find the elements in the diagonal which is related with the node. And then we can classify these elements into a set, which is called a relatively certain classification(3) Similar fuzzy classification: For each similar node denoted by #, find also the elements in the diagonal which related with the similar node. And then we can classify them into a set, which is called a similar fuzzy classification.Note: relatively certain classification can clearly classify the objects, while similar fuzzy classification cannot clearly classify the objects. For example in figure 1, the objects set U = {1, 2, 3, 4, 5} can be classified into two sets: A = {1, 2, [5]}, B = {3, 4, [5]}. However, the object “5” is not decided to belong to accurately.Definition 1(Similar Degree). Suppose A = {][,,...,21x x x x n }, similar degree α satisfies the following two conditions:(1) ],[+−i i t t is the similar coefficient of x and i x ; (2) },...,,min{022021101−++−++−++−−−−−−=nn n t t t t t t t t t λλλα So, if x similarly belongs to s A A A ,...,,21 at the same time, and the similar degree are s ααα,...,,21 respectively. Take },...,,max{21s j αααα=, if j α5.0≥, we believe that x should belong to set j A ; if 5.0<j α x should be formed a separate set.2.2. Interval-interval Clustering MethodInterval-Interval clustering method we proposed is an extension to value-interval clustering model; It replaces the threshold λ with interval.Interval-Interval clustering model needs netting, relatively certain classification and similarly fuzzy classification. In order to ensure the undecided element in the matrix to belong only one set, the concept of similar degree is extended.612 Y.Yin et al. /Journal of Computational Information Systems 4:2 (2008) 609-616Suppose λ is [+−00,λλ], A=]}[,,...,,{21x x x x n , ],[+−i i t t is the similar coefficient between x and i x .According to the following information formula:⎪⎩⎪⎨⎧<≤≤−−+=⎪⎩⎪⎨⎧<≤≤−−+=−−+−−−−−++++++−−−++−−+−+−+++002000201log 10log 100λλλαλλλαλλi i i t t t i i i i i i i t t t i i i i t if t t if t t t t if t t if t t t i i i i i i ],[+−i i αα can be worked out.Let ]},[],...,,[],,min{[],[2211+−+−+−+−=n n αααααααα, and call it as the similar degree of x similarly belonging to set A.If s βββ,...,,21 are also similar degrees belonged to set B, take },...,,max{21s j ββββ=.⎩⎨⎧<∈≥∈5.0)(_5.0)(j j j Center set new x Center A x ββ where center(j β) represents the center of j β, and new_set represents a new different set.2.3. Matrix-interval Clustering MethodMatrix-Interval Clustering Method is an extension to Interval-Interval Clustering Model, which make the λ take matrix value.For each ],[+−ij ij t t , it can be equal to 10,)(≤≤−+−+−u u t t t ij ij ij . Given an 0u called attitude factor, theinterval can be thus expressed by 0)(u t t t r ij ij ijij −+−−+=. So, we transform similar interval matrix R into two normal matrixes:⎥⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎢⎣⎡=1............112,1,1,2n n r r r M and ⎥⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎢⎣⎡=1............112,1,1,2n n u u u U , where the two normal matrix U is made up with different ij u which is related to different interval.After having a composition calculation to M and U respectively, we get their fuzzy equality relationships;Y.Yin et al. /Journal of Computational Information Systems 4:2 (2008) 609-616 613if take different λ value, we can get different classification results, where classification result is the intersection set of equality relationship M and U, Finally, the reasonable class according to the fact situation is produced. Matrix-Interval clustering model is to transform interval value matrix into two normal matrixes, and then, have a cluster by fuzzy equality relationship clustering method.Since matrix-interval clustering is different as the change of the value of ij u (all the values of ij u consist of matrix U), while the value of ij u is fairly influenced by the field knowledge, it needs the field experts to give special directions and then can get satisfied clustering results. But, since this kind of way changes the interval-value to normal one, the efficiency will be a fairly increase.Since there exist much more intervals in reality, while these intervals cannot be processed by the apriori method, three kinds of interval models are used to handle the issue, which is our motivation.3. Interval-based Data Mining3.1. Mining in Traditional DatabaseIn a traditional database, the records are often made up with a batch of numbers, which get their values in a certain field. The values of each field are changed corresponding to their own areas, but in certain area all the values are attributed to same class. A popular processing method is to divide them into several intervals according to the actual needs. However, a hard dividing boundary issue will be appeared, so the motivation we proposed interval clustering method is to solve the problem. By using our approach, the classification will be more reasonable and the boundary will be softened for the thresholds (for detail in section 2), and the user satisfied threshold can be changed according to the actual conditions, i.e., it is a dynamical process. It is more important for us to introduce a mechanism to provide an interface to user to change the threshold dynamically. That is to say, the real-world problem fields are abstracted to produce a general processing procedure to handle all kinds of information.Our approach to handle the traditional database is described as following:For the inaccurate field such as age, we can not give an accurate value for a man’ age, but we can give an interval for a man’s accurate age: [39, 40] for the man aged 39.4. So we introduce intervals in traditional database.For the accurate value, we can regard it as a special interval, e.g. [1.82, 1.82] for a man with 1.82 height. For other cases, we can use the traditional way to discretize them firstly and then handle them.In addition, we can divide the real problem area, and make all data of each field clustered automatically according to the user satisfied threshold. The algorithm is stated as follow.Algorithm 1 (Data Clustering Algorithm).Input: DB denotes database; Attr_List denotes attribute set; Thresh_Set denotes the threshold used to clusterOutput: the clustering results614 Y.Yin et al. /Journal of Computational Information Systems 4:2 (2008) 609-616Step 1: for each ∈i a Attr_List set Uni _= set Uni _+Transfer_ComparableType(i a ); //Transform all the attributes into comparative types, and save to i a Uni _.Step 2: for each i b ∈set Uni _ and i b ∈φ {//work out the similarities of all the values of each attributeStep 3: for any k, j ∈DB and k,j i b ∈{Step 4: Compute_Similiarity(k,j);}//Calculate similarity, for detail to see section 2Step 5: Gen_SimilarMatrix(i i M a ,);//Generate similarity matrix of values of certain attribute Step 6: C ←i M ;}//C is the array of similarity matrixStep 7: for each C m i ∈Step 8: G=GetValue(i m );//circle for interval clusteringStep 9: Gen_IntervalCluster(Attr_List,G);//ClusteringStep 10: S=statistic(G);//count the support of item setStep 11: Arrange_Matrix(DB,C);//Merge and arrange the last resultsIn the above steps, after getting the final clustering results, we can conduct a data mining procedure, and the results are certain to quantitative association rules (Srikant&Agrawal 1996, [12]), which describe the quantitative relationships among items.3.2. Mining in Interval DatabaseInterval Database Data Mining is quite difference with traditional database Data Mining in that it introduces the concept of interval database.Definition 2(Interval Database). Suppose n D D D ,...,,21are n fields, and )(),...,(),(21n D F D F D F are some sets respectively constructed by some intervals in n D D D ,...,,21. Regard them as value fields of attributes in which some relations will be defined. Make a Decare Product: )(...)()(21n D F D F D F ×××, and call one of this Decare set’s subsets as interval relations owned by record attributes, and now, the database is called interval database. A record of the interval database can be expressed by t = (n x x x ,...,21), where )(i i D F x ∈ (i = 1, 2, …, n).Definition 3(Interval Distance). Suppose [a, b], [c, d] are any two closed intervals, and the distanceY.Yin et al. /Journal of Computational Information Systems 4:2 (2008) 609-616 615between the two interval-values is defined as: d([a, b], [c, d]) = [M, N], where M= Min{ ||i x - i y ||}, N=Max{ ||i x - i y ||}, for any i x ∈[a, b], any i y ∈[c, d].E.g. [1, 3], [7, 8] are two intervals, and the distance between them is: d([1, 3], [7, 8])= [4, 7].Definition 4(Quantitative Association Rule). A quantitative association rule obey the form as follow: x ∈A ⇒y ∈B (Support, Confidence), where x and y stand for two attributes; A and B are two interval-values. The antecedent of the quantitative association describes that if x is attributed to the interval-value A then y is attributed to the interval-value B. Support (in the bracket) stands for the frequency of the rue appeared in the database, and confidence denotes the convincing degree of the rule. For example, “if a teacher can earn 5,000 - 9,000 dollars per year, he can purchase 1 - 3 cars in 5 years”. Interval data mining is to classify )(i D F by “interval clustering method”, and finally merge the database to reduce the verbose attributes, and transform to common quantitative database for mining. The algorithm is as follow.Step 1: Transform )(i D F related to attribute i D in the database into comparative type by generalizing and abstracting;Step 2: In the processed database, work out the interval distance between two figures for each ()i D F , and the distance is regarded as their similar measurement. So a similar matrix is generated;Step 3: Cluster according to one of the three interval models;Step 4: Decide whether the value is fit to the threshold, after labeling the attribute again, get quantitative attributes;Step 5: Make a data mining about quantitative attributes;Step 6: Repeat step 3 and step 4;Step 7: Arrange and merge the results of data mining.Step 8: Get the quantitative association rules.In the above algorithm, step 1 finishes a data transformation, because all the data we will handle are required to be comparative.Step 2 firstly computes the distance among interval-values, and then arranges them as a matrix. Step 3 uses any one of our introduced interval-value clustering methods to classify all the intervals.In step 4, we replace all the interval-values in the same class with a new identifying sign. Thus we get a quantitative dataset based on the original interval database.In step 5, we use the traditional data mining ways to handle the data.Step 6 – step 8 can help us to get association rules.3.3. ExperimentsIn order to validate the interval-based data mining, some classical datasets are used to test the efficiency and effectiveness. The experimental results show mining quantitative association rules by interval clustering is a much promising investigation. For publishing pages limit, the experimental results are616 Y.Yin et al. /Journal of Computational Information Systems 4:2 (2008) 609-616 arranged in ftp:///.4.Conclusions and Future WorkThe application of interval clustering in data mining has been discussed in this article. Firstly three kinds of interval clustering methods are proposed, and validated by examples; then interval clustering mining approaches in traditional database and interval database are provided respectively, by conducting some experiments about the classical datasets, mining quantitative association rules by interval clustering is proved to be a promising researchable direction.The future work should be done on the issues of handling non-numeric or non-comparable data, since interval approach to data mining is mainly suitable for processing numeric data or comparable data, especially for the interval database. It is also the future work for making the prototype software more practical and more useful.References[1]R. Agrawal,T. Imieliski, A. Swami. Mining association rules between sets of items in large databases. In:Proceedings of ACM SIGMOD, 1993: 207-216.[2]T. Jiang, A. Tan, K. Wang. Mining Generalized Associations of Semantic Relations from Textual Web Content.IEEE Transactions on Knowledge and Data Engineering, 19(2): 164 - 179, 2007.[3]P. Laxminarayan, S. A. Alvarez, C. Ruiz et al. Mining Statistically Significant Associations for ExploratoryAnalysis of Human Sleep Data. IEEE Transactions on Information Technology in Biomedicine, 10(3): 440 - 450, 2006.[4]Y. Takama. S. Hattori. Mining Association Rules for Adaptive Search Engine Based on RDF Technology; IEEETransactions on Industrial Electronics, 54(2): 790 - 796, 2007.[5]M. Song, S. Rajasekaran. A transaction mapping algorithm for frequent itemsets mining. IEEE Transactions onKnowledge and Data Engineering, 18(4): 472 - 481, 2006.[6]H. Lee, W. Park, and D. Park, An Efficient Method for Quantitative Association Rules to Raise Reliance of Data.In: APWeb 2004, 2004: 506-512.[7]Q. Song; M. Shepperd, M. Cartwright et al. Software defect association mining and defect correction effortprediction. IEEE Transactions on Software Engineering, 32(2): 69 - 82, 2006.[8]Shichao Zhang and Chengqi Zhang, Discovering Causality in Large Databases, Applied Artificial Intelligence,2002: 333-358.[9] A. Savasere, E. Omiecinski, and S. Navathe, Mining for strong negative associations in a large database ofcustomer transactions. In: Proceedings of ICDE. 1998: 494-502.[10]R. Miller, Y. Yang, Association Rules over Interval Data. In: Proceedings ACM SIGMOD97, 1997: 452-461[11] C. Hu, S. Xu, and X. Yang, An introduction to interval value algorithm, Systems Engineering - Theory &Practice, 2003 (4): 59-62.[12]R. Srikant and R. Agrawal, Mining Quantitative Association Rules in Large Tables. In: Proceedings of ACMSIGMOD, 1996: 1-12.。

相关文档
最新文档