Mining Quantitative Association Rules by Interval Clustering




Journal of Computational Information Systems4:2(2008) 609-616Available at Mining Quantitative Association Rules by Interval ClusteringYunfei YIN1†, Zhi ZHONG2, Yingxun WANG11 Department of Automatic Control, Beijing University of Aeronautics and Astronautics, Beijing 100083, China2 Department of Mathematics and Computer Science, Guangxi Teacher’s College, Nanning 530001, ChinaAbstractFor complex information processing, three novel algorithms related with quantitative association rules are proposed, which are value-interval clustering mining, interval-interval clustering mining and matrix-interval clustering mining. By comparison with the Apriori association rule mining, the interval approach has more practical interesting, especially to the inaccurate information. The new approach is a valuable attempt for solving the complex information processing issue, and provides many possible techniques to improve it. After introducing the concepts of value-interval clustering, interval-interval clustering, and matrix-interval clustering, the general interval modeling approach to data mining is stated with some examples added. Finally, some classical datasets are tested and the experimental results show the feasibility of the new approach.Keywords: Interval Clustering; Quantitative Association Rule; Data Mining1.IntroductionSince Agrawal et al. (1993) proposed the problem of mining association rules [1], it has been a fairly active branch in data mining. The motivation of association rule mining is to find how the items bought in a consumer basket related to each other. Given a set of transactions, where each transaction is a set of items, an association rule is an expression of the form X → Y, where X and Y are sets of items. An example of an association rule is: "30% of transactions that contain beer also contain diapers; 2% of all transactions contain both of these items". Here 30% is called the confidence of the rule, and 2% the support of the rule. The problem is to find all association rules that satisfy user-specified minimum support and minimum confidence constraints.The motivation of our interval approach to data mining is to find the interval relationships between items,e.g., 50 percent of teachers in college aged 40 to 65 and earned 8,000 to 100,000 dollars each year own 2 to4 cars. Because the half-baked and inaccurate information always exists in real-world, which makes many similar cases can be seen in our surroundings.Related work. For the traditional data mining, there are many extensions to enhance its efficiency and find more useful patterns, such as references [2-8]. A fast algorithm for mining Boolean association rules† Corresponding author.Email addresses: yinyunfei@ (Yunfei YIN), zhong8662@ (Zhi ZHONG).1553-9105/ Copyright © 2008 Binary Information PressFebruary, 2008610 Y.Yin et al. /Journal of Computational Information Systems 4:2 (2008) 609-616was proposed by Agrawal et al. which can be used to solve commodities arrangement in supermarket. Savasere et al. (1998, [9]) also discussed how to find strong negative association rules in the context of taxonomies of items, which is focused on finding rules which negatively correlate with rules.In order to find more valuable association rules, Lent et al. have mined out clustering association rules by clustering way (Miller&Yang 1997, [10]). However, because of the half-baked information and the ambiguity between different objects always exist, which make it possible for using some intervals to represent an attribute. Under such context, we propose the interval approach to handle the inaccurate information. Furthermore, a modeling method is introduced and formalized, which is suitable for mining inaccurate database information.The rest of the paper is organized as follows: in the following section, we will introduce some concepts and problems of quantitative association rules mining, and then three interval clustering methods are described in detail. In section 3, two interval mining ways are offered to handle normal database and interval database respectively. In section 4, some experimental results about some classical datasets are provided. A brief conclusion about the research will be given in section 5.2. Interval Clustering MethodsInterval theory gets its origin from computational mathematics, and has a widely use in many fields such as control engineering, electronic commerce (Hu&Xu&Yang 2003, [11]). The interval-based data mining is also a very important application about interval theory. In this section we propose three methods about interval data processing.2.1. Value-interval Clustering MethodSuppose n x x x ,...,,21 is n objects, whose actions are characterized by some interval values. According to the traditional clustering similarity formula, we can get correlative similarity matrix:⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡=+−+−+−1...],[],[.........1],[1)(22112121,n n n n j i t t t t t t r R , Where matrix R is a symmetry matrix, and ],[+−=ij ij ij t t r is the similarity between i x and j x , i, j = 1, 2, …,n.We propose a useful value-interval clustering model as follows:(1) Netting: in R, if 0λ>−ij t (the threshold 0λ is offered by domain experts or the users), the element],[+−ij ij t t is replaced by “×”; if 0λ<+ij t , the element ],[+−ij ij t t is replaced by space; if ],[0+−∈ij ij t t λ theelement is replaced by “#”.Call “×” as a node, and “#” as a similar node. We firstly drag a longitude and latitude from the node toY.Yin et al. /Journal of Computational Information Systems 4:2 (2008) 609-616 611diagonal, and use dash lines to draw longitude and latitude from the similar node to diagonal as described in figure 1.(2) Relatively certain classification: For each node denoted by ×, find the elements in the diagonal which is related with the node. And then we can classify these elements into a set, which is called a relatively certain classification(3) Similar fuzzy classification: For each similar node denoted by #, find also the elements in the diagonal which related with the similar node. And then we can classify them into a set, which is called a similar fuzzy classification.Note: relatively certain classification can clearly classify the objects, while similar fuzzy classification cannot clearly classify the objects. For example in figure 1, the objects set U = {1, 2, 3, 4, 5} can be classified into two sets: A = {1, 2, [5]}, B = {3, 4, [5]}. However, the object “5” is not decided to belong to accurately.Definition 1(Similar Degree). Suppose A = {][,,...,21x x x x n }, similar degree α satisfies the following two conditions:(1) ],[+−i i t t is the similar coefficient of x and i x ; (2) },...,,min{022021101−++−++−++−−−−−−=nn n t t t t t t t t t λλλα So, if x similarly belongs to s A A A ,...,,21 at the same time, and the similar degree are s ααα,...,,21 respectively. Take },...,,max{21s j αααα=, if j α5.0≥, we believe that x should belong to set j A ; if 5.0<j α x should be formed a separate set.2.2. Interval-interval Clustering MethodInterval-Interval clustering method we proposed is an extension to value-interval clustering model; It replaces the threshold λ with interval.Interval-Interval clustering model needs netting, relatively certain classification and similarly fuzzy classification. In order to ensure the undecided element in the matrix to belong only one set, the concept of similar degree is extended.612 Y.Yin et al. /Journal of Computational Information Systems 4:2 (2008) 609-616Suppose λ is [+−00,λλ], A=]}[,,...,,{21x x x x n , ],[+−i i t t is the similar coefficient between x and i x .According to the following information formula:⎪⎩⎪⎨⎧<≤≤−−+=⎪⎩⎪⎨⎧<≤≤−−+=−−+−−−−−++++++−−−++−−+−+−+++002000201log 10log 100λλλαλλλαλλi i i t t t i i i i i i i t t t i i i i t if t t if t t t t if t t if t t t i i i i i i ],[+−i i αα can be worked out.Let ]},[],...,,[],,min{[],[2211+−+−+−+−=n n αααααααα, and call it as the similar degree of x similarly belonging to set A.If s βββ,...,,21 are also similar degrees belonged to set B, take },...,,max{21s j ββββ=.⎩⎨⎧<∈≥∈5.0)(_5.0)(j j j Center set new x Center A x ββ where center(j β) represents the center of j β, and new_set represents a new different set.2.3. Matrix-interval Clustering MethodMatrix-Interval Clustering Method is an extension to Interval-Interval Clustering Model, which make the λ take matrix value.For each ],[+−ij ij t t , it can be equal to 10,)(≤≤−+−+−u u t t t ij ij ij . Given an 0u called attitude factor, theinterval can be thus expressed by 0)(u t t t r ij ij ijij −+−−+=. So, we transform similar interval matrix R into two normal matrixes:⎥⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎢⎣⎡=1............112,1,1,2n n r r r M and ⎥⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎢⎣⎡=1............112,1,1,2n n u u u U , where the two normal matrix U is made up with different ij u which is related to different interval.After having a composition calculation to M and U respectively, we get their fuzzy equality relationships;Y.Yin et al. /Journal of Computational Information Systems 4:2 (2008) 609-616 613if take different λ value, we can get different classification results, where classification result is the intersection set of equality relationship M and U, Finally, the reasonable class according to the fact situation is produced. Matrix-Interval clustering model is to transform interval value matrix into two normal matrixes, and then, have a cluster by fuzzy equality relationship clustering method.Since matrix-interval clustering is different as the change of the value of ij u (all the values of ij u consist of matrix U), while the value of ij u is fairly influenced by the field knowledge, it needs the field experts to give special directions and then can get satisfied clustering results. But, since this kind of way changes the interval-value to normal one, the efficiency will be a fairly increase.Since there exist much more intervals in reality, while these intervals cannot be processed by the apriori method, three kinds of interval models are used to handle the issue, which is our motivation.3. Interval-based Data Mining3.1. Mining in Traditional DatabaseIn a traditional database, the records are often made up with a batch of numbers, which get their values in a certain field. The values of each field are changed corresponding to their own areas, but in certain area all the values are attributed to same class. A popular processing method is to divide them into several intervals according to the actual needs. However, a hard dividing boundary issue will be appeared, so the motivation we proposed interval clustering method is to solve the problem. By using our approach, the classification will be more reasonable and the boundary will be softened for the thresholds (for detail in section 2), and the user satisfied threshold can be changed according to the actual conditions, i.e., it is a dynamical process. It is more important for us to introduce a mechanism to provide an interface to user to change the threshold dynamically. That is to say, the real-world problem fields are abstracted to produce a general processing procedure to handle all kinds of information.Our approach to handle the traditional database is described as following:For the inaccurate field such as age, we can not give an accurate value for a man’ age, but we can give an interval for a man’s accurate age: [39, 40] for the man aged 39.4. So we introduce intervals in traditional database.For the accurate value, we can regard it as a special interval, e.g. [1.82, 1.82] for a man with 1.82 height. For other cases, we can use the traditional way to discretize them firstly and then handle them.In addition, we can divide the real problem area, and make all data of each field clustered automatically according to the user satisfied threshold. The algorithm is stated as follow.Algorithm 1 (Data Clustering Algorithm).Input: DB denotes database; Attr_List denotes attribute set; Thresh_Set denotes the threshold used to clusterOutput: the clustering results614 Y.Yin et al. /Journal of Computational Information Systems 4:2 (2008) 609-616Step 1: for each ∈i a Attr_List set Uni _= set Uni _+Transfer_ComparableType(i a ); //Transform all the attributes into comparative types, and save to i a Uni _.Step 2: for each i b ∈set Uni _ and i b ∈φ {//work out the similarities of all the values of each attributeStep 3: for any k, j ∈DB and k,j i b ∈{Step 4: Compute_Similiarity(k,j);}//Calculate similarity, for detail to see section 2Step 5: Gen_SimilarMatrix(i i M a ,);//Generate similarity matrix of values of certain attribute Step 6: C ←i M ;}//C is the array of similarity matrixStep 7: for each C m i ∈Step 8: G=GetValue(i m );//circle for interval clusteringStep 9: Gen_IntervalCluster(Attr_List,G);//ClusteringStep 10: S=statistic(G);//count the support of item setStep 11: Arrange_Matrix(DB,C);//Merge and arrange the last resultsIn the above steps, after getting the final clustering results, we can conduct a data mining procedure, and the results are certain to quantitative association rules (Srikant&Agrawal 1996, [12]), which describe the quantitative relationships among items.3.2. Mining in Interval DatabaseInterval Database Data Mining is quite difference with traditional database Data Mining in that it introduces the concept of interval database.Definition 2(Interval Database). Suppose n D D D ,...,,21are n fields, and )(),...,(),(21n D F D F D F are some sets respectively constructed by some intervals in n D D D ,...,,21. Regard them as value fields of attributes in which some relations will be defined. Make a Decare Product: )(...)()(21n D F D F D F ×××, and call one of this Decare set’s subsets as interval relations owned by record attributes, and now, the database is called interval database. A record of the interval database can be expressed by t = (n x x x ,...,21), where )(i i D F x ∈ (i = 1, 2, …, n).Definition 3(Interval Distance). Suppose [a, b], [c, d] are any two closed intervals, and the distanceY.Yin et al. /Journal of Computational Information Systems 4:2 (2008) 609-616 615between the two interval-values is defined as: d([a, b], [c, d]) = [M, N], where M= Min{ ||i x - i y ||}, N=Max{ ||i x - i y ||}, for any i x ∈[a, b], any i y ∈[c, d].E.g. [1, 3], [7, 8] are two intervals, and the distance between them is: d([1, 3], [7, 8])= [4, 7].Definition 4(Quantitative Association Rule). A quantitative association rule obey the form as follow: x ∈A ⇒y ∈B (Support, Confidence), where x and y stand for two attributes; A and B are two interval-values. The antecedent of the quantitative association describes that if x is attributed to the interval-value A then y is attributed to the interval-value B. Support (in the bracket) stands for the frequency of the rue appeared in the database, and confidence denotes the convincing degree of the rule. For example, “if a teacher can earn 5,000 - 9,000 dollars per year, he can purchase 1 - 3 cars in 5 years”. Interval data mining is to classify )(i D F by “interval clustering method”, and finally merge the database to reduce the verbose attributes, and transform to common quantitative database for mining. The algorithm is as follow.Step 1: Transform )(i D F related to attribute i D in the database into comparative type by generalizing and abstracting;Step 2: In the processed database, work out the interval distance between two figures for each ()i D F , and the distance is regarded as their similar measurement. So a similar matrix is generated;Step 3: Cluster according to one of the three interval models;Step 4: Decide whether the value is fit to the threshold, after labeling the attribute again, get quantitative attributes;Step 5: Make a data mining about quantitative attributes;Step 6: Repeat step 3 and step 4;Step 7: Arrange and merge the results of data mining.Step 8: Get the quantitative association rules.In the above algorithm, step 1 finishes a data transformation, because all the data we will handle are required to be comparative.Step 2 firstly computes the distance among interval-values, and then arranges them as a matrix. Step 3 uses any one of our introduced interval-value clustering methods to classify all the intervals.In step 4, we replace all the interval-values in the same class with a new identifying sign. Thus we get a quantitative dataset based on the original interval database.In step 5, we use the traditional data mining ways to handle the data.Step 6 – step 8 can help us to get association rules.3.3. ExperimentsIn order to validate the interval-based data mining, some classical datasets are used to test the efficiency and effectiveness. The experimental results show mining quantitative association rules by interval clustering is a much promising investigation. For publishing pages limit, the experimental results are616 Y.Yin et al. /Journal of Computational Information Systems 4:2 (2008) 609-616 arranged in ftp:///.4.Conclusions and Future WorkThe application of interval clustering in data mining has been discussed in this article. Firstly three kinds of interval clustering methods are proposed, and validated by examples; then interval clustering mining approaches in traditional database and interval database are provided respectively, by conducting some experiments about the classical datasets, mining quantitative association rules by interval clustering is proved to be a promising researchable direction.The future work should be done on the issues of handling non-numeric or non-comparable data, since interval approach to data mining is mainly suitable for processing numeric data or comparable data, especially for the interval database. It is also the future work for making the prototype software more practical and more useful.References[1]R. Agrawal,T. Imieliski, A. Swami. Mining association rules between sets of items in large databases. In:Proceedings of ACM SIGMOD, 1993: 207-216.[2]T. Jiang, A. Tan, K. Wang. Mining Generalized Associations of Semantic Relations from Textual Web Content.IEEE Transactions on Knowledge and Data Engineering, 19(2): 164 - 179, 2007.[3]P. Laxminarayan, S. A. Alvarez, C. Ruiz et al. Mining Statistically Significant Associations for ExploratoryAnalysis of Human Sleep Data. IEEE Transactions on Information Technology in Biomedicine, 10(3): 440 - 450, 2006.[4]Y. Takama. S. Hattori. Mining Association Rules for Adaptive Search Engine Based on RDF Technology; IEEETransactions on Industrial Electronics, 54(2): 790 - 796, 2007.[5]M. Song, S. Rajasekaran. A transaction mapping algorithm for frequent itemsets mining. IEEE Transactions onKnowledge and Data Engineering, 18(4): 472 - 481, 2006.[6]H. Lee, W. Park, and D. Park, An Efficient Method for Quantitative Association Rules to Raise Reliance of Data.In: APWeb 2004, 2004: 506-512.[7]Q. Song; M. Shepperd, M. Cartwright et al. Software defect association mining and defect correction effortprediction. IEEE Transactions on Software Engineering, 32(2): 69 - 82, 2006.[8]Shichao Zhang and Chengqi Zhang, Discovering Causality in Large Databases, Applied Artificial Intelligence,2002: 333-358.[9] A. Savasere, E. Omiecinski, and S. Navathe, Mining for strong negative associations in a large database ofcustomer transactions. In: Proceedings of ICDE. 1998: 494-502.[10]R. Miller, Y. Yang, Association Rules over Interval Data. In: Proceedings ACM SIGMOD97, 1997: 452-461[11] C. Hu, S. Xu, and X. Yang, An introduction to interval value algorithm, Systems Engineering - Theory &Practice, 2003 (4): 59-62.[12]R. Srikant and R. Agrawal, Mining Quantitative Association Rules in Large Tables. In: Proceedings of ACMSIGMOD, 1996: 1-12.。
