based on matrix apriori - 360文档中心

合集下载

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

MATRIX APRIORI:

SPEEDING UP THE SEARCH FOR FREQUENT PATTERNS

Judith Pavón Department of Computer Science Anhembi Morumbi University

São Paulo, Brazil

jvmendoza@anhembi.br

Sidney Viana

Department of Computer Engineering

São Paulo University

São Paulo, Brazil

sidney.viana@p.br

Santiago Gómez

School of Computer Science and

Information Technology

RMIT University

Bundoora, Australia

sgomez@.au

ABSTRACT

This work discusses the problem of generating association rules from a set of transactions in a relational database, taking performance and accuracy of found results as the essential aspects for comparing association mining algorithms. We do a critical analysis of two previously existing methods, Apriori and FP-growth, emphasizing their strengths and weaknesses; and based on this analysis, we propose an algorithm called Matrix Apriori combining the best features of both.

Matrix Apriori utilizes simple structures such as matrices and vectors in the process of generating frequent patterns, and it also minimizes the number of candidate sets, thus achieving a more efficient computation than Apriori and FP-growth. The proposed algorithm can be easily extended to incorporate multiple minimal support defined by the user with the aim of improving method efficacy.

KEYWORDS

Data Mining, Databases, Association Rules, Frequent Patterns.

1. Introduction

Data mining, one of the steps in knowledge discovery from databases, has been recognized in the literature as one of the most important areas of current database research. The term “mining” characterizes the process of finding a set of really interesting patterns amid large quantities of data. One of the leading techniques in data mining is that of association rules.0

Given a collection of items, association rules describe how several combinations of items appear jointly in the same sets. A typical application of association rules is within the so-called market basket data analysis: it infers about customer’s buying habits by finding associations between different items put in the market basket. The discovery of such associations can help develop market strategies for those items that are frequently bought together [7]. An example of a rule is: “30% of transactions that contain beer also contain nappies; and 2% of all transactions contain both”.

Here is a formal description for the problem of mining association rules [4]:

Let

I = {i1, i2, …, i m} a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such that T⊆I. An association rule is an implication of the form X ⇒ Y, where X and Y are statements regarding the value of attributes, and at the same time, X ⊂ I, Y ⊂ I, and X ∩ Y = ∅.

For each association rule, there exist two magnitudes known as the support and the confidence [2]. Support is defined as the probability that a record satisfies both X and Y. Confidence is the probability that a record satisfies Y given that it satisfies X. In the previous example, 30% is the confidence level, and 2% is the support for the rule. Thus, rule X ⇒ Y has support s in the set of transactions D if s percent of transactions in D contain X ∪ Y; and rule X ⇒ Y has confidence c if the c percent of transactions in D that contain X also contain Y.

The problem consists of finding all association rules satisfying certain conditions of minimal support and minimal confidence, called minsup and minconf, which are user-specified parameters. The goal is to find regularity in customers’ behavior amongst product combinations that are often times bought jointly. Since the inception of association rule mining [2], a number of authors have investigated and proposed algorithms that perform this task in an efficient manner. Several algorithms have been developed for mining association rules, most of them following the logic of the Apriori algorithm [3], hence considered as one of the most influential ones in the field of rule mining. Another significant algorithm in this setting is the FP-growth [8], whose main objective is to minimize the candidate sets so as to obtain better performance than previous algorithms. The logic used by FP-growth completely differs from that proposed by Apriori; it uses the method of growing patterns rather than the candidate generation logic.

There have been several extensions put forward for Apriori and for FP-growth [5, 6, 9, 11, 12, 13, 14, 16], and although they offer improvements in certain aspects,