数据挖掘之异常检测

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

1 206.135.38.95 11:07:20 160.94.179.223 2 206.163.37.95 11:13:56 160.94.179.219 3 206.163.37.95 11:14:29 160.94.179.217 4 206.163.37.95 11:14:30 160.94.179.255 5 206.163.37.95 11:14:32 160.94.179.254 6 206.163.37.95 11:14:35 160.94.179.253 7 206.163.37.95 11:14:36 160.94.179.252 8 206.163.37.95 11:14:38 160.94.179.251 9 206.163.37.95 11:14:41 160.94.179.250 10 206.163.37.95 11:14:44 160.94.179.249
What are Anomalies?
• Anomaly is a pattern in the data that does not conform to the expected behavior • Anomaly is A data object that deviates significantly from the normal objects as if it were generated by a different mechanism • Also referred to as outliers, exceptions, peculiarities, surprises, etc. • Anomalies translate to significant (often critical) real life entities
Type of Anomalies*
• Point Anomalies • Contextual Anomalies • Collective Anomalies
Point Anomalies
• An individual data instance is anomalous w.r.t. the data
• Output of anomaly detection
– Score vs label
• Evaluation of anomaly detection techniques
– What kind of detection is good
Input Data
• Most common form of data handled by anomaly detection techniques is Record Data – Univariate – Multivariate
10
Input Data – Nature of Attributes
• Nature of attributes
– Binary – Categorical – Continuous – Hybrid
Tid SrcIP Duration 0.10 0.27 1.23 Dest IP 160.94.179.208 160.94.179.235 160.94.179.221 160.94.179.253 160.94.179.244 Number Internal of bytes 150 208 195 199 181 No No Yes No No
Key Challenges
• Defining a representative normal region is challenging • The boundary between normal and outlying behavior is often not precise • Availability of labeled data for training/validation • The exact notion of an outlier is different for different application domains • Data might contain noise • Normal behavior keeps evolving • Appropriate selection of relevant features
Data Labels
• Supervised Anomaly Detection
– Labels available for both normal data and anomalies
• Semi-supervised Anomaly Detection
– Labels available only for normal data
Engine Temperature 192 195 180 199 19 177 172 285 195 163
10
Input Data
• Most common form of data handled by anomaly detection techniques is Record Data – Univariate – Multivariate
1 206.163.37.81 2 206.163.37.99 3 160.94.123.45
4 206.163.37.37 112.03 5 206.163.37.41 0.32
Input Data – Complex Data Types
• Relationship among data instances
Y N1 o1 O3
o2 N2
X
Contextual Anomalies
• An individual data instance is anomalous within a context • Requires a notion of context • Also referred to as conditional anomalies*
• The individual instances within a collective anomaly are not anomalous by themselves
Anomalous Subsequence
Output of Anomaly Detection
• Label
– Each test instance is given a normal or anomaly label – This is especially true of classification-based approaches
– Sequential • Temporal – Spatial – Spatio-temporal – Graph
GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG
– Noise should be removed before outlier detection – Outliers are interesting: It violates the mechanism that generates the normal data
• Outlier detection vs. novelty detection: early stage, outlier; but later merged into the model
Anomaly Detection: A introduction
Source of slides: Tutorial At American Statistical Association (ASA2008) Jiawei Han-data mining : concepts and techniques Tutorial at the European Conference on Principles and Practice of Knowledge Discovery in Databases Speaker: Wentao Li
Tid SrcIP Start time Dest IP Dest Port 139 139 139 139 139 139 139 139 139 139 Number Attack of bytes 192 195 180 199 19 177 172 285 195 163 No No No No Yes No No Yes No Yes
• Unsupervised Anomaly Detection
– No labels assumed – Based on the assumption that anomalies are very rare compared to normal data
• Pay attention: here some materials give different descriptions, and we treat adopt the definition here though it is a bit ambiguous with the traditional definitional
– Dangerous + theft condition = theft – Money consumer: the poor and the rich
Anomaly Normal
* Xiuyao Song, Mingxi Wu, Christopher Jermaine, Sanjay Ranka, Conditional Anomaly Detection, IEEE Transactions on Data and Knowledge Engineering, 2006.
Collective Anomalies
• A collection of related data instances is anomalous • Requires a relationship among data instances
– Sequential Data – Spatial Data – Graph Data
Outline
• Definition • Application • Methods
– Limited time, So I just draw the picture of anomaly detection, for more detail, please turn to the paper for help.
• Map
– Related areas(theory) – Application(practice) – Problem formulation
• Detection effect +
Fra Baidu bibliotek
Aspects of Anomaly Detection Problem
• Nature of input data
– What is the characteristic of input data
• Availability of supervision
– Number of label
• Type of anomaly: point, contextual, structural
– Type of anomaly
– Cyber intrusions – Credit card fraud – Faults in mechanical systems
Related problems
• Outliers are different from the noise data
– Noise is random error or variance in a measured variable
相关文档
最新文档