第4讲数据挖掘的主要技术

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Maximize L.
7
MLE Example

Coin toss five times: {H,H,H,H,T} Assuming a perfect coin with H and T equally likely, the likelihood of this sequence is:
23
Similarity Measures

Determine similarity between two objects. Similarity characteristics:

Alternatively, distance measure measure how unlike or dissimilar objects are.

Popular technique for classification; Leaf node indicates class to which the corresponding tuple belongs.
28
Decision Tree Example
29
Decision Trees

A Decision Tree Model is a computational model consisting of three parts:

Decision Tree Algorithm to create the tree Algorithm that applies the tree to data

Creation of the tree is the most difficult part. Processing is basically a search similar to that in a binary search tree (although DT may not be binary).
30
Decision Tree Algorithm
31
DT Advantages/Disadvantages

Advantages:

Easy to understand. Easy to generate rules
May suffer from overfitting. Classifies by rectangular partitioning. Does not easily handle nonnumeric data. Can be quite large – pruning is necessary.

Why square? Root Mean Square Error (RMSE)
4
Jackknife Estimate

Jackknife Estimate: estimate of parameter
is obtained by omitting one value from the set of observed values. Ex: estimate of mean for X={x1, … , xn}

Solves estimation with incomplete data. Obtain initial estimates for parameters. Iteratively use estimates for missing data and continue until convergence.

Training Data:
ID 1 2 3 4 5 6 7 8 9 10 Income 4 3 2 3 4 2 3 2 3 1 Credit Excellent Good Excellent Good Good Excellent Bad Bad Bad Bad 17 Class h1 h1 h1 h1 h1 h1 h2 h2 h3 h4 xi x4 x7 x2 x7 x8 x2 x11 x10 x11 x9

R contains 100 employees 99 have salary information Mean salary of these is $50,000 Use $50,000 as value of remaining employee’s salary. Is this a good idea?
Hypothesis Testing

Find model to explain behavior by creating and then testing a hypothesis about the data. Exact opposite of usual DM approach. H0 – Null hypothesis; Hypothesis to be tested. H1 – Alternative hypothesis
20
Regression

Predict future values based on past values Linear Regression assumes linear relationship exists. y = c0 + c1 x1 + … + cn xn Find values to best fit the data
14
Bayes Theorem

Posterior Probability: P(h1|xi) Prior Probability: P(h1) Bayes Theorem:

Assign probabilities of hypotheses given a data value.
15
Bayes Theorem Example

Credit authorizations (hypotheses): h1=authorize purchase, h2 = authorize after further identification, h3=do not authorize, h4= do not authorize but contact police Assign twelve data values for all combinations of credit and income:
21
Linear Regression
22
Correlation

Examine the degree to which the values for two variables behave similarly. Correlation coefficient r:
• • •
1 = perfect correlation -1 = perfect but opposite correlation 0 = no correlation
Lecture 4
Data Mining Techniques Outline
1
Goal: Provide an overview of basic data
mining techniques

Statistical Point Estimation Models Based on Summarization Bayes Theorem Hypothesis Testing Regression and Correlation Similarity Measures Decision Trees Neural Networks Activation Functions Genetic Algorithms
1 Excellent Good Bad x1 x5 x9 2 x2 x6 x10 3 x3 x7 x11 4 x4 x8 x12

From training data: P(h1) = 60%; P(h2)=20%; P(h3)=10%; P(h4)=10%.
16
Bayes Example(cont’d)

5
Example

已知集合{1,3,9,15,20},求均值的折叠刀估计和均值标准差的折叠刀估计。
6
Maximum Likelihood Estimate (MLE)

Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model. Joint probability for observing the sample data by multiplying the individual probabilities. Likelihood function:
24
Similarity Measures
25
Distance Measures

Measure dissimilarity between objects
26
Twenty Questions Game
27
Decision Trees

Decision Tree (DT):

Tree where the root and each internal node is labeled with a question. The arcs represent each possible answer to the associated question. Each leaf node represents a prediction of a solution to the problem.
19
Chi Squared Statistic

O – observed value E – Expected value based on hypothesis.

Ex:

O={50,93,67,78,87} E=75 c2=15.55 and therefore significant
2
wenku.baidu.com
Point Estimation

Point Estimate: estimate a population
parameter. May be made by calculating the parameter for a sample. May be used to predict value for missing data. Ex:

However if the probability of a H is 0.8 then:
8
MLE Example (cont’d)

General likelihood formula:

Estimate for p is then 94/5 = 0.8
Expectation-Maximization (EM)
3
Estimation Error

Bias: Difference between expected value and
actual value.

Mean Squared Error (MSE): expected value
of the squared difference between the estimate and the actual value:
10
EM Example
11
EM Algorithm
12
Models Based on Summarization

Visualization: Frequency distribution, mean, variance,
median, mode, etc.

Box Plot:
13
Scatter Diagram
Bayes Example(cont’d)

Calculate P(xi|hj) and P(xi) Ex: P(x7|h1)=2/6; P(x4|h1)=1/6; P(x2|h1)=2/6; P(x8|h1)=1/6; P(xi|h1)=0 for all other xi. Predict the class for x4: Calculate P(hj|x4) for all hj. Place x4 in class with largest value. Ex: P(h1|x4)=(P(x4|h1)(P(h1))/P(x4) =(1/6)(0.6)/0.1=1. 18 x4 in class h1.

第4讲 数据挖掘的主要技术

第4讲数据挖掘的主要技术