中国科大 数据挖掘slide10

合集下载
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
– Measuring the weights of oranges, but a few grapefruit are mixed in

Natural variation
– Unusually tall people

Data errors
– 200 pound 2 year old
Data Mining

Efficiency Context
– Professional basketball team

Data Mining
03/06/2011
8
Variants of Anomaly Detection Problems

Given a data set D, find all data points x D with anomaly scores greater than some threshold t Given a data set D, find all data points x D having the top-n largest anomaly scores Given a data set D, containing mostly normal (but unlabeled) l b l d) d data t points, i t and dat test t point i t x, compute the anomaly score of x with respect to D

Usually assume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on
– Data distribution – Parameters of distribution (e.g., mean, variance) – Number of expected outliers (confidence limit)
– Supervised

Anomalies are regarded as a rare class Need to have training data
Data Mining
03/06/2011
10
Additional Anomaly Detection Techniques

Proximity-based
– One in a thousand occurs often if you have lots of data – Context is important, important e e.g., g freezing temps in July

Can be important or a nuisance

Pattern matching
– Create profiles or templates of atypical but important events t or objects bj t – Algorithms to detect these patterns are usually simple and efficient


I Issues
– Identifying the distribution of a data set
Heavy
tailed distribution
– Number of attributes – Is the data a mixture of distributions?

Noise doesn’t necessarily produce unusual values or objects Noise is not interesting Anomalies may be interesting if they are not a result of noise Noise and anomalies are related but distinct concepts
Data Mining
03/06/2011
11
Graphical Approaches

Boxplots or scatter plots Limitations
– Not automatic – Subjective

Data Mining
03/06/2011
12
Convex Hull Method

Other approaches assign a score to all points
– This score measures the degree to which an object is an anomaly – This allows objects to be ranked

In the end, you often need a binary decision

Extreme points are assumed to be outliers Use convex hull method to detect extreme values

What if the outlier occurs in the middle of the data?
03/06/2011 13
03/06/2011
4
Distinction Between Noise and Anomalies

Noise is erroneous, perhaps random, values or contaminating objects
– Weight recorded incorrectly – Grapefruit mixed in with the oranges
03/06/2011 5


Data Mining
General Issues: Number of Attributes

Many anomalies are defined in terms of a single attribute
– Height – Shape p – Color

Can be hard to find an anomaly using all attributes
– The set of data points that are considerably different than the remainder i d of f the h d data

Natural implication is that anomalies are relatively rare
Anomaly Detection
Dr. Hui Xiong Rutgers University
Data Mining to Data Mining Introduction
03/06/2011
08/06/2006
1
1
Anomaly/Outlier Detection

What are anomalies/outliers?
03/06/2011 6
Data Mining
General Issues: Anomaly Scoring

Many anomaly detection techniques provide only a binary categorization
– An object j is an anomaly y or it isn’t – This is especially true of classification-based approaches
Data Mining
Statistical Approaches
Probabilistic definition of an outlier: An outlier is an object that has a low probability with respect to a probability distribution model of the data.
– Should this credit card transaction be flagged? – Still useful to have a score

How many anomalies are there?
03/06/2011 7
Data Mining
Other Issues for Anomaly Detection

Find all anomalies at once or one at a time
– Swamping – Masking g

Evaluation
– How do you measure performance? – Supervised vs. unsupervised situations


Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html
Data Mining
03/06/2011
3
Causes of Anomalies

Data from differenFra Baidu bibliotek classes
– Noisy or irrelevant attributes – Object is only anomalous with respect to some attributes

However, an object may not be anomalous in any one attribute
– 10 foot tall 2 year old – Unusually high blood pressure
Data Mining
03/06/2011
2
Importance of Anomaly Detection
Ozone Depletion History

In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded!
– Anomalies are points far away from other points – Can C d detect t t thi this graphically hi ll i in some cases

Density-based
– Low density points are outliers


Data Mining
03/06/2011
9
Model-Based Anomaly Detection

Build a model for the data and see
– Unsupervised

Anomalies are those points that don’t fit well Anomalies are those points that distort the model Examples: – – – – – Statistical distribution Clusters Regression Geometric Graph
相关文档
最新文档