数据挖掘课件

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。


Data warehouse needs consistent integration of quality data

Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

Data dispersion characteristics
median,
2013年6月28日星期五
9
Measuring the Central Tendency

Mean (algebraic measure):


x
1 n

i 1
n
xi
x
Weighted arithmetic mean:
Descriptive data summarization
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
24
2013年6月28日星期五
Summary
8
2013年6月28日星期五
Mining Data Dispersion Characteristics

Motivation
To
better understand the data: central tendency, variation and spread
max, min, quantiles, outliers, variance, etc.
12
Histogram Analysis

Graph displays of basic statistical class descriptions Frequency histograms

A univariate graphical method Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data
2013年6月28日星期五
4
Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Interpretability Accessibility

Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile range: IQR = Q3 – Q1 Five number summary: min, Q1, M, Q3, max Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually Outlier: usually, a value higher/lower than 1.5 x IQR Variance σ2: (algebraic, scalable computation)
Data Preprocessing

Why preprocess the data?
Descriptive data summarization
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
1
2013年6月28日星期五
Why Data Preprocessing?

Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data


Fill in missing values


Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
25
2013年6月28日星期五
Why Is Data Preprocessing Important?

No quality data, no quality mining results!

Quality decisions must be based on quality data

e.g., duplicate or missing data may cause incorrect or even misleading statistics.
inconsistent with other recorded data and thus deleted data not entered due to misunderstanding

Missing data may be due to

certain data may not be considered important at the time of entry
2013年6月28日星期五
5
Major Tasks in Data Preprocessing

Data cleaning

Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
2013年6月28日星期五
16
Positively and Negatively Correlated Data
2013年6月28日星期五
21
Not Correlated Data
2013年6月28日星期五
22
Data Preprocessing

Why preprocess the data?
2
2013年6月28日星期五
Why Is Data Dirty?

Incomplete data comes from

n/a data value when collected
different consideration between the time when the data was collected and when it is analyzed. human/hardware/software problems collection entry
Part of data reduction but with particular importance, especially for numerical data
6

Data discretization

2013年6月28日星期五
Forms of data preprocessing
2013年6月28日星期五
Symmetric vs. Skewed Data

Median, mean and mode of symmetric, positively and negatively skewed data
2013年6月28日星期五
11
Measuring the Dispersion of Data

Quartiles, outliers and boxplots

e.g., occupation=―‖ e.g., Salary=―-10‖

noisy: containing errors or outliers


inconsistent: containing discrepancies in codes or names


e.g., Age=―42‖ Birthday=―03/07/1997‖ e.g., Was rating ―1,2,3‖, now rating ―A, B, C‖ e.g., discrepancy between duplicate records


Noisy data comes from the process of data


transmission
Different data sources

Inconsistent data comes from

Functional dependency violation
3
2013年6月28日星期五
Integration of multiple databases, data cubes, or files Normalization and aggregation

Data integration


Data transformation
ຫໍສະໝຸດ Baidu

Data reduction

Obtains reduced representation in volume but produces the same or similar analytical results

Mode


Value that occurs most frequently in the data
Unimodal, bimodal, trimodal Empirical formula:
mean mode 3 ( mean median )
2013年6月28日星期五
10
Data Cleaning

Importance ―Data cleaning is one of the three biggest problems in data warehousing‖—Ralph Kimball ―Data cleaning is the number one problem in data warehousing‖—DCI survey Data cleaning tasks

i 1
n
wi xi wi
Trimmed mean: chopping extreme values

i 1
n

Median: A holistic measure

Middle value if odd number of values, or average of the middle two values otherwise
7
Data Preprocessing

Why preprocess the data?
Descriptive data summarization
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
2

Variance and standard deviation




1 n

i 1
n
( xi x )
2

1 n
[ xi
i 1
n
2

1 n
( xi )
i 1
n
2
]
Standard deviation σ is the square root of variance σ2
2013年6月28日星期五
Missing Data

Data is not always available

E.g., many tuples have no recorded value for several attributes, such as customer income in sales data equipment malfunction
相关文档
最新文档