数据预处理技术.
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
incomplete:
e.g., occupation=―‖
noisy:
containing errors or outliers
codes or names
e.g., Salary=―-10‖ e.g., Age=―42‖ Birthday=―03/07/1997‖ e.g., Was rating ―1,2,3‖, now rating ―A, B, C‖ e.g., discrepancy between duplicate records
浙江大学本科生《数据挖掘导论》课件
第2课 数据预处理技术
徐从富,副教授 浙江大学人工智能研究所
内容提纲
Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Integration of multiple databases, data cubes, or files Normalization and aggregation Obtains reduced representation in volume but produces the same or similar analytical results
inconsistent: containing discrepancies in
Why Is Data Dirty?
Incomplete data comes from
n/a data value when collected
different consideration between the time when the data was collected and when it is analyzed.
added Interpretability Accessibility
Broad categories:
intrinsic, contextual, representational, and
accessibility.
Major Tasks in Data Preprocessing
Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view:
Accuracy Completeness Consistency Timeliness Believability Value
e.g., duplicate or missing data may cause incorrect or even misleading statistics.
Data
warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse. —Bill Inmon
Data integration
Data transformation
Data reduction
Data discretization
Part of data reduction but with particular importance, especially for numerical data
Data cleaning tasks
Fill in
missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration
Forms of data preprocessing
II.
Data Cleaning
Importance
―Data
cleaning is one of the three biggest problems in data warehousing‖—Ralph Kimball ―Data cleaning is the number one problem in data warehousing‖—DCI survey
human/hardware/software problems
Noisy data comes from the process of data
collection
entry
transmission
Inconsistent data comes from
Different data sources
Functional dependency violation
Why Is Data Preprocessing Important?
Biblioteka Baidu
No quality data, no quality mining results!
Quality
decisions must be based on quality data
I.
Why Data Preprocessing?
lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
Data in the real world is dirty
e.g., occupation=―‖
noisy:
containing errors or outliers
codes or names
e.g., Salary=―-10‖ e.g., Age=―42‖ Birthday=―03/07/1997‖ e.g., Was rating ―1,2,3‖, now rating ―A, B, C‖ e.g., discrepancy between duplicate records
浙江大学本科生《数据挖掘导论》课件
第2课 数据预处理技术
徐从富,副教授 浙江大学人工智能研究所
内容提纲
Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Integration of multiple databases, data cubes, or files Normalization and aggregation Obtains reduced representation in volume but produces the same or similar analytical results
inconsistent: containing discrepancies in
Why Is Data Dirty?
Incomplete data comes from
n/a data value when collected
different consideration between the time when the data was collected and when it is analyzed.
added Interpretability Accessibility
Broad categories:
intrinsic, contextual, representational, and
accessibility.
Major Tasks in Data Preprocessing
Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view:
Accuracy Completeness Consistency Timeliness Believability Value
e.g., duplicate or missing data may cause incorrect or even misleading statistics.
Data
warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse. —Bill Inmon
Data integration
Data transformation
Data reduction
Data discretization
Part of data reduction but with particular importance, especially for numerical data
Data cleaning tasks
Fill in
missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration
Forms of data preprocessing
II.
Data Cleaning
Importance
―Data
cleaning is one of the three biggest problems in data warehousing‖—Ralph Kimball ―Data cleaning is the number one problem in data warehousing‖—DCI survey
human/hardware/software problems
Noisy data comes from the process of data
collection
entry
transmission
Inconsistent data comes from
Different data sources
Functional dependency violation
Why Is Data Preprocessing Important?
Biblioteka Baidu
No quality data, no quality mining results!
Quality
decisions must be based on quality data
I.
Why Data Preprocessing?
lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
Data in the real world is dirty