中国科大 数据挖掘slide10
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
– Measuring the weights of oranges, but a few grapefruit are mixed in
Natural variation
– Unusually tall people
Data errors
– 200 pound 2 year old
Data Mining
Efficiency Context
– Professional basketball team
Data Mining
03/06/2011
8
Variants of Anomaly Detection Problems
Given a data set D, find all data points x D with anomaly scores greater than some threshold t Given a data set D, find all data points x D having the top-n largest anomaly scores Given a data set D, containing mostly normal (but unlabeled) l b l d) d data t points, i t and dat test t point i t x, compute the anomaly score of x with respect to D
Usually assume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on
– Data distribution – Parameters of distribution (e.g., mean, variance) – Number of expected outliers (confidence limit)
– Supervised
Anomalies are regarded as a rare class Need to have training data
Data Mining
03/06/2011
10
Additional Anomaly Detection Techniques
Proximity-based
– One in a thousand occurs often if you have lots of data – Context is important, important e e.g., g freezing temps in July
Can be important or a nuisance
Pattern matching
– Create profiles or templates of atypical but important events t or objects bj t – Algorithms to detect these patterns are usually simple and efficient
I Issues
– Identifying the distribution of a data set
Heavy
tailed distribution
– Number of attributes – Is the data a mixture of distributions?
Noise doesn’t necessarily produce unusual values or objects Noise is not interesting Anomalies may be interesting if they are not a result of noise Noise and anomalies are related but distinct concepts
Data Mining
03/06/2011
11
Graphical Approaches
Boxplots or scatter plots Limitations
– Not automatic – Subjective
Data Mining
03/06/2011
12
Convex Hull Method
Other approaches assign a score to all points
– This score measures the degree to which an object is an anomaly – This allows objects to be ranked
In the end, you often need a binary decision
Extreme points are assumed to be outliers Use convex hull method to detect extreme values
What if the outlier occurs in the middle of the data?
03/06/2011 13
03/06/2011
4
Distinction Between Noise and Anomalies
Noise is erroneous, perhaps random, values or contaminating objects
– Weight recorded incorrectly – Grapefruit mixed in with the oranges
03/06/2011 5
Data Mining
General Issues: Number of Attributes
Many anomalies are defined in terms of a single attribute
– Height – Shape p – Color
Can be hard to find an anomaly using all attributes
– The set of data points that are considerably different than the remainder i d of f the h d data
Natural implication is that anomalies are relatively rare
Anomaly Detection
Dr. Hui Xiong Rutgers University
Data Mining to Data Mining Introduction
03/06/2011
08/06/2006
1
1
Anomaly/Outlier Detection
What are anomalies/outliers?
03/06/2011 6
Data Mining
General Issues: Anomaly Scoring
Many anomaly detection techniques provide only a binary categorization
– An object j is an anomaly y or it isn’t – This is especially true of classification-based approaches
Data Mining
Statistical Approaches
Probabilistic definition of an outlier: An outlier is an object that has a low probability with respect to a probability distribution model of the data.
– Should this credit card transaction be flagged? – Still useful to have a score
How many anomalies are there?
03/06/2011 7
Data Mining
Other Issues for Anomaly Detection
Find all anomalies at once or one at a time
– Swamping – Masking g
Evaluation
– How do you measure performance? – Supervised vs. unsupervised situations
Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html
Data Mining
03/06/2011
3
Causes of Anomalies
Data from differenFra Baidu bibliotek classes
– Noisy or irrelevant attributes – Object is only anomalous with respect to some attributes
However, an object may not be anomalous in any one attribute
– 10 foot tall 2 year old – Unusually high blood pressure
Data Mining
03/06/2011
2
Importance of Anomaly Detection
Ozone Depletion History
In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded!
– Anomalies are points far away from other points – Can C d detect t t thi this graphically hi ll i in some cases
Density-based
– Low density points are outliers
Data Mining
03/06/2011
9
Model-Based Anomaly Detection
Build a model for the data and see
– Unsupervised
Anomalies are those points that don’t fit well Anomalies are those points that distort the model Examples: – – – – – Statistical distribution Clusters Regression Geometric Graph
Natural variation
– Unusually tall people
Data errors
– 200 pound 2 year old
Data Mining
Efficiency Context
– Professional basketball team
Data Mining
03/06/2011
8
Variants of Anomaly Detection Problems
Given a data set D, find all data points x D with anomaly scores greater than some threshold t Given a data set D, find all data points x D having the top-n largest anomaly scores Given a data set D, containing mostly normal (but unlabeled) l b l d) d data t points, i t and dat test t point i t x, compute the anomaly score of x with respect to D
Usually assume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on
– Data distribution – Parameters of distribution (e.g., mean, variance) – Number of expected outliers (confidence limit)
– Supervised
Anomalies are regarded as a rare class Need to have training data
Data Mining
03/06/2011
10
Additional Anomaly Detection Techniques
Proximity-based
– One in a thousand occurs often if you have lots of data – Context is important, important e e.g., g freezing temps in July
Can be important or a nuisance
Pattern matching
– Create profiles or templates of atypical but important events t or objects bj t – Algorithms to detect these patterns are usually simple and efficient
I Issues
– Identifying the distribution of a data set
Heavy
tailed distribution
– Number of attributes – Is the data a mixture of distributions?
Noise doesn’t necessarily produce unusual values or objects Noise is not interesting Anomalies may be interesting if they are not a result of noise Noise and anomalies are related but distinct concepts
Data Mining
03/06/2011
11
Graphical Approaches
Boxplots or scatter plots Limitations
– Not automatic – Subjective
Data Mining
03/06/2011
12
Convex Hull Method
Other approaches assign a score to all points
– This score measures the degree to which an object is an anomaly – This allows objects to be ranked
In the end, you often need a binary decision
Extreme points are assumed to be outliers Use convex hull method to detect extreme values
What if the outlier occurs in the middle of the data?
03/06/2011 13
03/06/2011
4
Distinction Between Noise and Anomalies
Noise is erroneous, perhaps random, values or contaminating objects
– Weight recorded incorrectly – Grapefruit mixed in with the oranges
03/06/2011 5
Data Mining
General Issues: Number of Attributes
Many anomalies are defined in terms of a single attribute
– Height – Shape p – Color
Can be hard to find an anomaly using all attributes
– The set of data points that are considerably different than the remainder i d of f the h d data
Natural implication is that anomalies are relatively rare
Anomaly Detection
Dr. Hui Xiong Rutgers University
Data Mining to Data Mining Introduction
03/06/2011
08/06/2006
1
1
Anomaly/Outlier Detection
What are anomalies/outliers?
03/06/2011 6
Data Mining
General Issues: Anomaly Scoring
Many anomaly detection techniques provide only a binary categorization
– An object j is an anomaly y or it isn’t – This is especially true of classification-based approaches
Data Mining
Statistical Approaches
Probabilistic definition of an outlier: An outlier is an object that has a low probability with respect to a probability distribution model of the data.
– Should this credit card transaction be flagged? – Still useful to have a score
How many anomalies are there?
03/06/2011 7
Data Mining
Other Issues for Anomaly Detection
Find all anomalies at once or one at a time
– Swamping – Masking g
Evaluation
– How do you measure performance? – Supervised vs. unsupervised situations
Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html
Data Mining
03/06/2011
3
Causes of Anomalies
Data from differenFra Baidu bibliotek classes
– Noisy or irrelevant attributes – Object is only anomalous with respect to some attributes
However, an object may not be anomalous in any one attribute
– 10 foot tall 2 year old – Unusually high blood pressure
Data Mining
03/06/2011
2
Importance of Anomaly Detection
Ozone Depletion History
In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded!
– Anomalies are points far away from other points – Can C d detect t t thi this graphically hi ll i in some cases
Density-based
– Low density points are outliers
Data Mining
03/06/2011
9
Model-Based Anomaly Detection
Build a model for the data and see
– Unsupervised
Anomalies are those points that don’t fit well Anomalies are those points that distort the model Examples: – – – – – Statistical distribution Clusters Regression Geometric Graph