6-data mining(1) - 360文档中心

合集下载

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Part II Data Mining
Outline
The Concept of Data Mining(数据挖掘概念) Architecture of a Typical Data Mining System (数据挖掘系统结构)
What can be Mined? (能挖掘什么？)
Major Issues(主要问题)in Data Mining
Data Cleaning(数据清理)
3What Is Data Mining?
Data mining is the process of discovering interesting knowledge from large amounts of data. (数据挖掘是从大量数据中发现有趣知识的过程) The main difference that separates information retrieval apart from data mining is their goals. (数据挖掘和信息检索的主要差别在于他们的目标) Information retrieval is to help users search for documents or data that satisfy their information needs(信息检索帮用户寻找他们需要的文档/数据)e.g. Find customers who have purchased more than $10,000 in the last month .
(查找上个月购物量超过1万美元的客户)
Data mining discovers useful knowledge by analyzing data correlations using sophisticated data mining techniques(数据挖掘用复杂技术分析…)e.g. Find all items which are frequently purchased with milk .
(查找经常和牛奶被购买的商品)
A KDD Process (1) Some people view data mining as synonymous
5A KDD Process (2)
Learning the application domain (学习应用领域相关知识):
Relevant knowledge & goals of application (相关知识和目标) Creating a target data set (建立目标数据集) Data selection, Data cleaning and preprocessing (预处理)
Choosing functions of data mining (选择数据挖掘功能)
Summarization, classification, association, clustering , etc.
Choosing the mining algorithm(s) (选择挖掘算法)
Data mining (进行数据挖掘): search for patterns of interest Pattern evaluation and knowledge presentation (模式评估和知识表示)
Removing redundant patterns, visualization, transformation, etc.Present results to user in meaningful manner.
Use of discovered knowledge (使用所发现的知识)
7
Concept/class description (概念/类描述)
Characterization(特征): provide a summarization of the given data set Comparison(区分): mine distinguishing characteristics(挖掘区别特征)that differentiate a target class from comparable contrasting classes. Association rules (correlation and causality)(关联规则)
Association rules are of the form(这种形式的规则): X ⇒Y,Examples: contains(T, “computer”) ⇒contains(T, “software”)
[support = 1%, confidence = 50%]
age(X, “20..29”) ∧income(X, “20..29K ”) ⇒buys(X, “PC ”)
[support = 2%, confidence = 60%]
Classification and Prediction (分类和预测)
Find models that describe and distinguish classes for future prediction.What kinds of patterns can be mined?(1)
What kinds of patterns can be mined?(2)
Cluster(聚类)
Group data to form some classes(将数据聚合成一些类)
Principle: maximizing the intra-class similarity and minimizing the interclass similarity (原则: 最大化类内相似度，最小化类间相似度) Outlier analysis: objects that do not comply with the general behavior / data model. (局外者分析: 发现与一般行为或数据模型不一致的对象) Trend and evolution analysis (趋势和演变分析)
Sequential pattern mining(序列模式挖掘)
Regression analysis(回归分析)
Periodicity analysis(周期分析)
Similarity-based analysis(基于相似度分析)
What kinds of patterns can be mined?(3)
In the context of text and Web mining, the knowledge also includes: (在文本挖掘或web挖掘中还可以发现)
Word association (术语关联)
Web resource discovery (WEB资源发现)
News Event (新闻事件)
Browsing behavior (浏览行为)
Online communities (网上社团)
Mining Web link structures to identify authoritative Web pages finding spam sites (发现垃圾网站)
Opinion Mining (观点挖掘)
…
10Major Issues in Data Mining (1)
Mining methodology(挖掘方法)and user interaction
Mining different kinds of knowledge in DBs (从DB 挖掘不同类型知识) Interactive mining of knowledge at multiple levels of abstraction (在多个抽象层上交互挖掘知识)
Incorporation of background knowledge (结合背景知识)
Data mining query languages (数据挖掘查询语言)
Presentation and visualization of data mining results(结果可视化表示) Handling noise and incomplete data (处理噪音和不完全数据) Pattern evaluation (模式评估)
Performance and scalability (性能和可伸缩性) Efficiency(有效性)and scalability(可伸缩性)of data mining algorithms
Parallel(并行), distributed(分布) & incremental(增量)mining methods
©Wu Yangyang 11Major Issues in Data Mining (2)
Issues relating to the diversity of data types (数据多样性相关问题)
Handling relational and complex types of data (关系和复杂类型数据) Mining information from heterogeneous databases and www(异质异构) Issues related to applications (应用相关的问题) Application of discovered knowledge (所发现知识的应用)
Domain-specific data mining tools (面向特定领域的挖掘工具)
Intelligent query answering (智能问答) Process control(过程控制)and decision making(决策制定)
Integration of the discovered knowledge with existing knowledge:
A knowledge fusion problem (知识融合)
Protection of data security(数据安全), integrity(完整性), and privacy
12
Cultures
Databases: concentrate on large-scale (non-main-memory) data.(数据库：关注大规模数据)
To a database person, data-mining is an extreme form of analytic processing. Result is the data that answers the query.
(对数据库工作者而言数据挖掘是一种分析处理, 其结果就是问题答案) AI (machine-learning): concentrate on complex methods, small data.(人工智能(机器学习)：关注复杂方法，小数据)
Statistics: concentrate on models. (统计：关注模型.)
To a statistician, data-mining is the inference of models. Result is the parameters of the model (数据挖掘是模型推论, 其结果是一些模型参数)e.g. Given a billion numbers, a statistician might fit the billion points to the best Gaussian distribution and report the mean and standard deviation.
©Wu Yangyang 13
Data Cleaning (1)
Data Preprocessing (数据预处理):
Cleaning, integration, transformation, reduction, discretization (离散化) Why data cleaning? (为什么要清理数据？)
--No quality data, no quality mining results! Garbage in, Garbage out! Measure of data quality (数据质量的度量标准)
Accuracy (正确性)
Completeness (完整性)
Consistency(一致)
Timeliness(适时)
Believability(可信)
Interpretability(可解释性) Accessibility(可存取性)
14Data Cleaning (2)
Data in the real world is dirty
Incomplete (不完全)：Lacking some attribute values (缺少一些属性值)Lacking certain interest attributes /containing only aggregate data
(缺少某些有用属性或只包含聚集数据)
Noisy(有噪音): containing errors or outliers(包含错误或异常) Inconsistent: containing discrepancies in codes or names
(不一致: 编码或名称存在差异)
Major tasks in data cleaning (数据清理的主要任务)
Fill in missing values (补上缺少的值)
Identify outliers(识别出异常值)and smooth out noisy data(消除噪音)
Correct inconsistent data(校正不一致数据) Resolve redundancy caused by data integration (消除集成产生的冗余)
15Data Cleaning (3)
Handle missing values (处理缺值问题) Ignore the tuple (忽略该元组) Fill in the missing value manually (人工填补) Use a global constant to fill in the missing value (用全局常量填补) Use the attribute mean to fill in the missing value (该属性平均值填补) Use the attribute mean for all samples belonging to the same class to fill in the missing value (用同类的属性平均值填补) Use the most probable value(最大可能的值)to fill in the missing value Identify outliers and smooth out noisy data(识别异常值和消除噪音)
Binning method (分箱方法):
First sort data and partition into bins (先排序、分箱)
Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.(然后用平均值、中值、边界值平滑)
©Wu Yangyang 16
Data Cleaning (4)
Example: Sorted data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins (分成等深的箱):
-Bin 1: 4, 8, 9, 15
-Bin 2: 21, 21, 24, 25
-Bin 3: 26, 28, 29, 34
Smoothing by bin means (用平均值平滑):
-Bin 1: 9, 9, 9, 9
-Bin 2: 23, 23, 23, 23
-Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries (用边界值平滑):
-Bin 1: 4, 4, 4, 15
-Bin 2: 21, 21, 25, 25
-Bin 3: 26, 26, 26, 34
Clustering (。