数据挖掘作业1hw1
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
HW1
Due Date: Nov. 2
Submission requirements:
Please submit your solutions to our class website. Only hand in what is required below.
Upload the Clementine stream containing the assignment execution to our class website so that we may refer to it if necessary.
Part I: 书面作业
1. 假定数据仓库中包含4个维:date, product, vendor, location;和两个度量:sales_volume和
sales_cost。
(a) 画出该数据仓库的星形模式图
(b) 由基本方体[date, product, vendor, location]开始,列出每年在Los Angles的每个vendor的
sales_volume。
(c) 对于数据仓库,位图索引是有用的。以该立方体为例,简略讨论使用位图索引结构的优点和问
题。
2.Suppose a hospital tested the age and body fat data for 18 random selected adults with the following result:
(a) Calculate the mean, median, and standard deviation of age and %fat.
(b) Draw the boxplots for age and %fat.
(c) Draw a scatter plot based on these two variables.
(d) Normalize the two variables based on min-max normalization.
(e) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these two variables
positively or negatively correlated?
3. 下面是一个超市某种商品连续20个月的销售数据(单位为百元)
21,16,19,24,27,23,22,21,20,17,16,20,23,22,18,24,26, 25,20,26。对以上数据进行深度为5的Equal-depth binning.
(a) 采用bin median方法进行平滑;
(b) 采用bin boundaries方法进行平滑。
4. A database has 5 transactions. Let min_sup = 40%.
(a) Find all frequent itemsets using Apriori. Please present the frequent itemsets in each iteration.
(b) Find all frequent itemsets using FP-growth. Please present all the FP-trees, conditional pattern bases, and frequent patterns in the entire process.
5. Consider the data set shown in Question 4. Assume min_sup = 40%, min_conf=40%.
(a) Please present the confidence for all the 3-frequent itemsets.
(b) List all of the strong association rules (with support s and confidence c) matching the
following metarule, where X is a variable representing customers, and item i denotes variables representing ite ms (e.g. “A”, “B”, etc.):
Part II: 上机作业:Recommendation Systems
The goal of this assignment is to learn the use of market basket analysis for the purpose of making product purchase recommendations to the customers.
The data set contains transactions from a large supermarket. Each transaction is made by someone holding the loyalty card. We limited the total number of categories in this supermarket data to 20 categories for simplicity. The field value for a certain product in the transaction basket is 1 if the customer has bought it and 0 if he/she has not. The file named “Transactions” has data for 46243 transactions.
The data are available from the class web page.
Your written submission should consist only of those deliverables marked indicated by “Hand-in”. Market basket analysis has the objective to discover individual products, or groups of products that tend to occur together in transactions. The knowledge obtained from a market basket analysis can be employed by a business to recognize products frequently sold together in order to determine recommendations and cross-sell and up-sell opportunities. It can also be used to improve the efficiency of a promotional campaign.
Run Apriori on “transaction” data set. Set the “Type” of “COD” as “Typeless”, set the “direction” of all the other 20 categories as “Both”, set their “Type” as “Flag”. Set “Minimum antecedent support” to be 5%, “Minimum confidence” to be 50%, and “Maximum number of antecedents” to be 5 in the modeling node (Apriori node). In general you should explore by trying different values of these parameters to see what type of rules you get.
∙Hand-in: The list of association rules generated by the model.
∙Sort the rules by lift, support, and confidence, respectively to see the rules identified.
Hand-in: For each case, choose top 5 rules (note: make sure no redundant rules in the 5 rules) and give 2-3 lines comments. Many of the rules will be logically redundant and therefore will have to be eliminated after you think carefully about them.