Dataminingsimulationvolume(《数据挖掘》模拟卷)

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Data mining simulation volume (《数据挖掘》模拟卷)
"Data mining" simulation volume
First, fill in the blanks (1 points per grid, 20 points)
1,in data mining, the commonly used clustering algorithms include: division method
Hierarchical approach
Density based approach
Grid based approach and model-based approach.
2,the data warehouse multidimensional data model can be divided into three different forms, namely, star model, snowflake model
and
fact constellation
3,from the point of view of data analysis, data mining can be divided into two categories: descriptive data mining and predictive data mining
4,given the basic square, there are three options for the materialization of the square: no materialization
Total materialization and
Partial materialization
5,the three main research directions in current data mining research are:
Database technology,
Statistics、
and
machine learning
6,there are four types of concept hierarchy, namely: and
7,two commonly used large data sets of data generalization method is
and
Two, radio questions (please choose a correct answer, fill in brackets, 2 points per subject, a total of 20 points)
1.which of the following classifications belongs to the neural network learning algorithm?(
A.decision tree induction
B.Bayes classification
C.backward propagation classification
D.case-based reasoning
The 2. confidence measure (confidence) is an index of interestingness measures.
A,simplicity
B,deterministic
C,, practicality
D,novelty
3.which of the following situations does outlier mining apply?
A and target market analysis
B,shopping basket analysis
C and pattern recognition
D, credit card fraud detection
4.the cube that holds the lowest level is called:
A,vertex cube
Lattices of B and squares
C,basic cube
D and dimension
5., the purpose of data reduction is ()
A to fill the vacancy values of data types
B,integrating data from multiple data sources
C,get the compressed representation of the data set
D,normalized data
6.which of the following data preprocessing techniques can be used
to smooth data and eliminate data noise?
A.data cleaning
B.data integration
C.data transformation
D.data reduction
7.() reduce the number of given continuous values by dividing the attribute field into intervals.
A.concept hierarchy
B.discretization
partment
D.histogram
8.in the following data operations, the () operation is not the OLAP operation on the multidimensional data model.
A, 1 (roll-up)
B,select (select)
C,slice (slice)
D,rotating shaft (pivot)
9.assume that the current data mining task is a description of the general characteristics of the customer in the database, and the data mining function is usually used
A.association analysis
B.classification and prediction
C.outlier analysis
D.evolution analysis
E.concept description
10.which of the following is true?()
A,classification and clustering are guided learning
B,classification, and clustering are unsupervised learning
C and classification are guided learning, and clustering is unsupervised learning
D and classification are unsupervised learning, and clustering is
guided learning
Three, multiple-choice questions (please select two or more than two correct answers in parentheses, each of 3 points, a total of 15 points)
1., according to the dimension of data involved in association analysis, the association rules can be classified as:()
A and Boolean association rules
B and single dimension association rules
C and multidimensional association rules
D and multilayer association rules
2.which of the following is the possible content of the data transformation?
A,data compression
B and data generalization
C and dimension reduction
D,standardization
3.when it comes to task related data, it refers to ()
A, a database or data warehouse name that contains relevant data
B and the conditions for selecting the relevant data
C,related attributes or dimensions
D,sorting and grouping instructions for retrieving data
4.from a structural point of view, the data warehouse model includes the following categories:
A. enterprise warehouse
B.data mart
C.virtual warehouse
rmation warehouse
Five
The main features of a data warehouse include
A,subject oriented
B,integrated
C,time-varying
D,nonvolatile
Four, Jane answer (25 points)
1.briefly describe the basic idea of attribute oriented induction,
and explain when to use attribute deletion and when to use attribute generalization.
(7 points)
2.why do we need an independent data warehouse instead of working directly on a routine database when we are doing OLAP?. (6 points)
3.what are the search strategies for the mining of multi-level association rules with decreasing support? What are the
characteristics of each? (6 points)
pared with other applications, what are the advantages of data mining in e-commerce? (6 points)
Five, algorithm problems (20 points)
1.Apriori algorithm is a common algorithm for mining single dimension Boolean association rules from transaction databases. The algorithm uses the prior knowledge of frequent itemsets to find frequent itemsets from candidate items.
(1)what are the two basic steps (2 points) of the Aprior algorithm?;
(2)the transaction data record D (| D = 4) shown in the figure below. Use diagrams and explanations to explain how to use the Apriori algorithm to find frequent itemsets in D. (assume that the minimum transaction support count is 2) (10 points)
TID
Item ID list
T100
A, C, D
T200
B, C, E
T300
A,B, C, E
T400
B,E
2.decision tree induction algorithm is a commonly used
classification algorithm
(1)briefly describe the basic strategy of decision tree induction algorithm (4 points);
(2)the use of decision tree algorithm, according to the customer's age age (divided into 3 age groups: <18, 18...
23, >23), income (income value of high, medium, low), is student (value of yes and no), credit rating credit rating (value of fair and excellent). To determine whether the user will buy PC Game, namely the construction of decision tree buysPCGame, assuming that the existing data after the first time were divided as shown in the results, and according to the results of each attribute for each partition in the calculation of the information gain
Customers for age<18: Gain (income), =0. 022, Gain (student), =0. 162, Gain (credit rating) =0. 323
Customers for age>23: Gain (income), =0. 042, Gain (student), =0. 462, Gain (credit rating) =0. 155
Please draw the decision tree buysPCGame according to the above results. (4 points)
"Data mining" simulation volume answer
First, fill in the blanks (1 points per grid, 20 points)
1,divide the method, the hierarchical method, the density based method.
2,star mode, snowflake mode and fact constellation mode.
3,descriptive data mining and predictive data mining.
4,not materialized, fully materialized and partially materialized.
5,database technology, statistics, machine learning.
6,pattern layering, collection grouping, layering, operation exporting, layering and rule based layering.
7,data cube method (or OLAP) and attribute oriented induction method.
Two, radio questions (please choose a correct answer, fill in brackets, 2 points per subject, a total of 20 points)
1, C 2, —B 3, —D_ 4, C 5, C
6,A 7, —B 8, _B 9, —E 10, —C
Three, multiple-choice questions (please select two or more than two correct answers in parentheses, each of 3 points, a total of 15 points)
1, BD 2,________________BD 3, _ABCD_ 4, _ABC 5, _ABCD__
Four, Jane answer (25 points)
1. briefly describe the basic idea of attribute oriented induction, and explain when to use attribute deletion and when to use attribute generalization. (7 points)
Answer: the basic idea is: firstly, using attribute relational database query collected relevant data: each attribute data and then through investigating the tasks related to different values of the number of generalization (by attribute deletion or attribute generalization).
Aggregation by merging equal generalized tuples and accumulating their corresponding numerical values. This compresses the data set after generalization. As a result, generalized relations can be mapped to different forms, such as diagrams or rules, to provide users. (3 points)
Use the property to remove the case: if a property of initial work on the relationship between a large number of different values, but (1) this property no generalized operator, or (2) it had expressed high-
level concepts with other attributes: (2 points)
Use attribute generalization: if there is a large number of different values on an attribute of the initial work relationship, and there is a generalized operator on that property. (2 points)
2.why do we need an independent data warehouse instead of working directly on a routine database when we are doing OLAP?. (6 points)
Answer: using an independent data warehouse for OLAP processing is for the following purposes:
(1)improve the performance of the two systems
The operation of the database is designed for OLTP, not for OLAP operation optimization, while processing OLAP queries in operational databases, the performance will greatly reduce the operation task: and the data warehouse is designed for OLAP, complex OLAP query, multidimensional view, summary OLAP function provides the optimization.
(2)the two functions have different functions
Parallel operation of multi transaction database support, data warehouse is often only read-only access of data records: and when the recovery mechanism for the operation of OLAP parallel mechanism and transaction processing, will significantly degrade the performance of OLAP.
(3)the two have different data
Historical data is stored in a data warehouse: the database stored in
the daily operations is often just the latest data.
3.what are the search strategies for the mining of multi-level association rules with decreasing support? What are the characteristics of each? (6 points)
Answer: search strategies used in mining multilevel association rules with decreasing support include:
Layer by layer independence: full width search without background knowledge of frequent itemsets for pruning. Investigate each node, regardless of whether its parent is frequent. The feature is that the condition is loose, which can result in examining a large number of non frequent items at the lower level and finding some unimportant associations: (2 points)
Layer cross k- item set filtering: a k- item set of layer I is investigated, if and only if it is in the (iT) layer of the corresponding parent node, the k- itemsets are frequent. The feature is that the restrictions are too strong, and some valuable patterns may be filtered out by that method; (2 points)
Layer crossing individual filtering: a layer I entry is investigated if and only if it is in the (iT) layer of the parent node that is frequent. It is a compromise between the two extreme strategies. (2 points)
pared with other applications, what are the advantages of data mining in e-commerce? (6 points)
A: compared to other applications, the advantages of data mining in e-
commerce include:
E-commerce provides huge amounts of data:
Click stream (Clickstreams) will generate a large amount of data mining by e-commerce:
Rich record information:
Good WEB site design will help you get rich information about goods, categories, visitors, and so on;
Clean data:
Electronic data collected from e-commerce sites requires no manual input or integration from the historical system;
Research results are easy to translate:
In e-commerce, a lot of knowledge discovery can be applied directly;
Investment returns are easy to measure:
All the data are electronic, and it is very convenient to generate all kinds of reports and calculate all kinds of income.
Five, algorithm problems (20 points)
1, answer:
(1)the basic steps of Aprior algorithm include connection and pruning
(2)
Using Apiori property, C3 is generated by L2
1.connection:
C3=L2 L2=
({A, C}, {B, C}, {B, E) {C, E), {{A, C}, {B, C}, {B, E) {C, E}} ={{A, B, C}, E)), {A, C, C, E}, {B, C, a.
2.pruning with Apriori properties: all subsets of frequent itemsets must be frequent, C3,
We can delete its subsets as non frequent options:
The 2 subsets of (A, B, and C} are (A, B}, (A, C}, (B, , C}, where {A and B} are not elements of the L2, so delete this option;
The 2 subsets of {A, C, and E} are {A, C}, {A, E}, {C,, E}, where {A and E} are not elements of the L2, so delete this option;
The 2 subsets of (B, C, and E} are (B, C}, (B, E}, (C, E}, and all 2 - item subsets of it are elements of L2, so keep this option.
3.this way, after pruning, you get C3= ((B, C, E))
After branch, get C3= ((B, C, E))
2, answer:
(1) the basic strategy of decision tree induction algorithm is as
follows:
The tree starts with nodes representing a single training sample.
If the samples are in the same class, the node becomes a leaf and is marked with that class.
Otherwise, the algorithm uses entropy based metrics that become information gain as heuristic information, and selects the attributes that best classify the samples.
For each known value of the test property, create a branch and divide the sample accordingly.
The algorithm uses the same procedure recursively to form a sample decision tree on each partition. Once an attribute appears on a node, it is not necessary to consider any offspring of that node.
The recursive division step stops only when one of the following conditions is established:
(a)all samples of a given node belong to the same class:
(b)no remaining attributes can be used to further partition the samples. In this case, the nodes are converted into leaves using the class number obtained by majority voting.
(c)if a branch does not have a sample, a leaf is created with the majority of the classes that it divides into prior training samples.
(2) the decision tree buysPCGame is as follows:。