data mining course(ppt)数据挖掘课程课件

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
witten & eibe
In between, choose between A and B with appropriate probabilities
13
Cost Sensitive Learning
There are two types of errors Predicted class Yes No
100
0.06
N
5
2422
CPH (Cumulative Pct Hits)
Definition: CPH(P,M) = % of all targets in the first P% of the list scored by model M CPH frequently called Gains
2
Direct Marketing Paradigm
Find most likely prospects to contact
Not everybody needs to be contacted Number of targets is usually much smaller than number of prospects Typical Applications
More in a later lesson
18
Evaluating numeric prediction
Same strategies: independent test set, cross-validation, significance tests, etc.
Difference: error measures
15
Different Costs
In practice, true positive and false negative errors often incur different costs
Examples:
Medical diagnostic tests: does X have leukemia? Loan decisions: approve mortgage for X?
Goal: select mailing with the highest profit Evaluation: Maximum actual profit from selected list (with mailing cost = $0.68)
Sum of (actual donation-$0.68) for all records with predicted/ expected donation > $0.68
9
*ROC curves

ROC curves are similar to gains charts
Stands for ―receiver operating characteristic‖ Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel

Differences from gains chart:

y axis shows percentage of true positives in sample rather than
absolute number sample size
x axis shows percentage of false positives in sample rather than
Web mining: will X click on this link?
Promotional mailing: will X buy the product? …
16
Cost-sensitive learning
Most learning schemes do not perform cost-sensitive learning
Re-sampling of instances according to costs Weighting of instances according to costs
Some schemes are inherently cost-sensitive, e.g. naï ve Bayes
witten & eibe
10
*A sample ROC curve
Jagged curve—one set of test data
Smooth curve—use cross-validation
witten & eibe
11
*ROC curves for two schemes
For a small, focused sample, use method A For a larger one, use method B
100 90 80 70 60 50 40 30 20 10 0
5 15 25 35 45 55 65 75 85 95
5% of random list have 5% of targets
Q: What is expected value for CPH(P,Random) ?
A: Expected value for CPH(P,Random) = P
| p1 a1 | ... | pn an | n
Sometimes relative error values are more appropriate (e.g. 10% for an error of 50 when predicting 500)
witten & eibe
22
Actual class
Yes
No
TP: True positive
FP: False positive
FN: False negative
TN: True negative
Machine Learning methods usually minimize FP+FN Direct marketing maximizes TP
Improvement on the mean
How much does the scheme improve on simply predicting the average?
Cumulative % Hits
Random Model
Pct list
Lift
Lift (at 5%)
= 21% / 5% = 4.2 better than random
Lift(P,M) = CPH(P,M) / P
4.源自文库 4 3.5 3 2.5 2 1.5 1 Lift
0.5 Note: Some (including Witten & 0 Eibe) use “Lift” for what we call CPH.
4
Model-Sorted List
Use a model to assign score to each customer Sort customers by decreasing score Expect more targets (hits) near the top of the list No 1 2 3 4 5 … 99 Score Target CustID Age 0.97 0.95 0.94 0.93 0.92 … 0.11 Y N Y Y N N 1746 1024 2478 3820 4897 … 2734 … … … … … … … 3 hits in top 5% of the list If there 15 targets overall, then top 5 has 3/15=20% of targets
They generate the same classifier no matter what costs are assigned to the different classes Example: standard decision tree learner
Simple methods for cost-sensitive learning:
5
15
25
35
45
55
65
75
85
P -- percent of the list
95
Lift Properties
Q: Lift(P,Random) =
A: 1 (expected value, can vary)
Q: Lift(100%, M) =
A: 1 (for any model M)
Evaluation – next steps Lift and Costs
Outline
Lift and Gains charts
*ROC Cost-sensitive learning Evaluation for numeric predictions MDL principle and Occam’s razor
Actual target values: a1 a2 …an
Predicted target values: p1 p2 … pn Most popular measure: mean-squared error
( p1 a1 ) 2 ... ( pn an ) 2 n
Easy to manipulate mathematically
witten & eibe
21
Other measures
The root mean-squared error :
( p1 a1 ) 2 ... ( pn an ) 2 n
The mean absolute error is less sensitive to outliers than the mean-squared error:
Q: Can lift be less than 1?
A: yes, if the model is inverted (all the non-targets precede targets in the list)
Generally, a better model has higher lift
retailers, catalogues, direct mail (and e-mail) customer acquisition, cross-sell, attrition prediction ...
3
Direct Marketing Evaluation
Accuracy on the entire dataset is not the right measure
Cumulative % Hits
Random
Pct list
CPH: Random List vs Modelranked list
100 90 80 70 60 50 40 30 20 10 0
5 15 25 35 45 55 65 75 85 95
5% of random list have 5% of targets, but 5% of model ranked list have 21% of targets CPH(5%,model)=21%.
Approach
develop a target model score all prospects and rank them by decreasing score
select top P% of prospects for action
How to decide what is the best selection?
17
KDD Cup 98 – a Case Study
Cost-sensitive learning/data mining widely used, but rarely published
Well known and public case study: KDD Cup 1998
Data from Paralyzed Veterans of America (charity)
相关文档
最新文档