数据挖掘——数据预处理

合集下载
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。


Inconsistent data may come from
Different data sources Functional dependency violation (e.g., modify some linked data)

Duplicate records also need data cleaning

Noisy data (incorrect values) may come from
Faulty data collection instruments Human or computer error at data entry Errors in data transmission

2017/10/8
1 N
2
1 ( x ) i N i 1
2
n
xi 2
2 i 1
14
n
Standard deviation s (or σ) is the square root of variance s2 (or σ2)
Boxplot Analysis


Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integration
Integration of multiple databases, data cubes, or files
2017/10/8
16
Properties of Normal Distribution Curve
• e.g., occupation=“ ”
noisy: containing errors or outliers
• e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C”
3
Data Preprocessing Overview

Why preprocess the data?


Descriptive data summarization
Data cleaning Data transformation Data integration Data reduction Discretization and concept hierarchy generation Summary
Selection and Transformation Pattern Evaluation
Data Mining
Data Warehouse Data Cleaning and Integration Databases
2017/10/8
Flat files
2
Review

Learning the application domain
6
2017/10/8
Why Is Data Preprocessing Important?

No quality data, no quality mining results!
Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even misleading statistics
Data Mining
Ying Liu, Prof., Ph.D
School of Computer and Control University of Chinese Academy of Sciences
Review
Data mining—core of knowledge discovery process
The median is marked by a line within the box
Whiskers: two lines outside the box extend to Minimum and Maximum, terminate at 1.5xIQR
2017/10/8 15
Boxplot Analysis
2017/10/8
5
Why Is Data Dirty?

Incomplete data may come from
“Not applicable” data value when collected Different considerations between the time when the data was collected and when it is analyzed Human/hardware/software problems

Numerical dimensions correspond to sorted intervals

Data dispersion: analyzed with multiple granularities of precision Boxplot or quantile analysis on sorted intervals

Data discretization
Data reduction for numerical data
8
2017/10/8
Forms of Data Preprocessing
2017/10/8
9
Data Preprocessing Overview

Why preprocess the data?
13
Measuring the Dispersion of Data

Quartiles, outliers and boxplots

Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile range: IQR = Q3 – Q1 Five number summary: min, Q1, M, Q3, max Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually Outlier: usually, a value higher/lower than 1.5 x IQR


Descriptive data summarization
Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary
Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum

Boxplot
Data is represented with a box
The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
11

2017/10/8
Measuring the Central Tendency

Mean (algebraic measure):

1 n Mean x xi n i 1 Weighted arithmetic mean:
Trimmed mean: chopping extreme values
Find useful features, dimensionality/variable reduction, invariant representation

Choosing the mining algorithm(s) to search for patterns of interest
relevant prior knowledge and goals of application
Creating a target data resource Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation
mean mode 3 (mean median)
12
2017/10/8
Symmetric vs. Skewed Data

Median, mean and mode of symmetric, positively and negatively skewed data
2017/10/8
10
2017/10/8
Data Descriptive Characteristics

Central tendency
Mean, weighted mean, etc.

Dispersion characteristics

median, max, min, quantiles, outliers, variance, etc.
4
2017/10/8
Why Data Preprocessing?

Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregated data

Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse
2017/10/8
7
Major Tasks in Data Preprocessing
x
wx
i 1 n i
n
i
Leabharlann Baiduw
i 1
i

Median: A holistic measure

Middle value if odd number of values, or average of the middle two values
otherwise

Mode

Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula:


Variance and standard deviation (sample: s, population: σ)

Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 2 2 2 s ( xi x ) [ xi ( xi ) ] n 1 i 1 n 1 i 1 n i 1
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar analytical results

Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use 2017/10/8
of discovered knowledge
相关文档
最新文档