哈工大数据挖掘课件-chapter_2
合集下载
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
• A collection of attributes describe an object
– Object is also known as record, point, case, sample, entity, or instance
2013/9/16
Divorced 220K Single Married Single 85K 75K 90K
– Nominal:类型或标称变量
• 无大小之分,又无等级或次序之分,仅是一种标称或类别。 • 取值离散,可以用字符串型变量表示 • Exp.:性别、部门单位或颜色等
– Ordinal:顺序变量/序数变量
• 离散值, 其值尽管大小没有特定意义,但按照顺序排列。变量值之间的次 序是有一定意义的,打乱定义将产生错误 • Exp.: 名次、级别、职务等
GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG
HIT-DBLAB
12
ball
lost
pla y
wi n
Document 1 Document 2 Document 3
3 0 0
0 7 1
5 0 0
0 2 0
2 1 1
6 0 2
0 0 2
2 3 0
0 0 3
2 0 0
2013/9/16
HIT-DBLAB
2
Graph Data
• Examples: Generic graph and HTML Links
Chemical Data
• Benzene Molecule: C6H6
2 5 2 5 1
2013/9/16
HIT-DBLAB
13
2013/9/16
HIT-DBLAB
14
Ordered Data
• Sequences of transactions
Items/Events
Ordered Data
• Genomic sequence data
Projection of x Load 10.23 12.65 Projection of y load 5.27 6.25 Distance Load Thickness
Divorced 95K Married 60K
Divorced 220K Single Married Single 85K 75K 90K
<a href="papers/papers.html#bbbb"> Data Mining </a> <li> <a href="papers/papers.html#aaaa"> Graph Partitioning </a> <li> <a href="papers/papers.html#aaaa"> Parallel Solution of Sparse Linear System of Equations </a> <li> <a href="papers/papers.html#ffff"> N-Body Computation and Dense Linear System Solvers
Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items. – For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.
– ID has no limit but age has a maximum and minimum value
HIT-DBLAB
• • • •
变量的值是在特定区间上有意义,超出该区间将没有意义 线性区间、数值连续的变量. 只可以用数值型变量表示,可以反映变量值之间的大小差异 Exp.: 身高、体重.
10
Taxable Income Cheat 125K 100K 70K 120K No No No No Yes No No Yes No Yes
2.1 Data?
Yes No No Yes No No Yes No No No
Single Married Single Married
Divorced 95K Married 60K
– Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Often represented as integer variables. – Note: binary attributes are a special case of discrete attributes
accuracy精确性completeness完整性consistency一致性timeliness适时性believability可信性valueadded附加价值interpretability可解释性accessibility可访问性2013916hitdblab24majortasksindatapreprocessing?数据清理填补空值识别去除离群点平滑噪声数据解决数据不一致以及解决因数据集成带来的冗余52013916hitdblab25majortasksindatapreprocessing?数据集成将数据由多个源合并成一致的数据存储如数据仓库2013916hitdblab26majortasksindatapreprocessing?数据变换如规范化用于改进涉及距离度量的挖掘算法的精度和有效性aggregation2013916hitdblab27majortasksindatapreprocessing?数据归约在不影响数据分析结果的前提下减少数据量?通过聚集删除冗余特性或聚类等方法来压缩数据2013916hitdblab28majortasksindatapreprocessing?数据离散化partofdatareductionbutwithparticularimportanceespeciallyfornumericaldata2013916hitdblab2923描述性数据汇总2013916hitdblab30描述性数据汇总?动机
Types of data sets
• Record
– Data Matrix – Document Data – Transaction Data
• Graph
– World Wide Web – Molecular Structures
• Continuous Attribute
– Has real numbers as attribute values – Examples: temperature, height, or weight. – Practically, real values can only be measured and represented using a finite number of digits. – Continuous attributes are typically represented as floatingpoint variables.
– Examples: eye color of a person, temperature, etc. – Attribute is also known as variable, field, characteristic, or feature Objects
Attributes
Tid Refund Marital Status 1 2 3 4 5 6 7 8 9 10
Single Married Single Married
• If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute
Tid Refund Marital Status 1 2 3 4 5 6 7 8 9 10
10
Data Matrix
Taxable Income Cheat 125K 100K 70K 120K No No No No Yes No No Yes No Yes
Yes No No Yes No No Yes No No No
– Ratio:比例变量
• 按照一定间隔、比例计量数据的变量类型 • 非线性区间标度中取正的度量值 • Exp.: AeBt 或 Ae-Bt
5 2013/9/16
2013/9/16
HIT-DBLAB
6
1
Discrete and Continuous Attributes
• Discrete Attribute
HIT-DBLAB
3
2013/9/16
HIT-DBLAB
4
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute • Distinction between attributes and attribute values
– Interval :区间变量
– Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers • But properties of attribute values can be different
15.22 16.22
2.7 2.2
1.2 1.1
10
2013/9/16
HIT-DBLAB
9
2013/9/16
HIT-DBLAB
Document Data
• Each document becomes a `term' vector,
– each term is a component (attribute) of the vector, – the value of each component is the number of times the corresponding term occurs in the document.
TID Items
timeout
season
coach
game
score
team
1 2 3 4 5
11 2013/9/16
Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk
– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
Types of Attributes
• There are different types of attributes
2013/9/16
• Ordered
– Spatial Data – Temporal Data – Sequential Data
7 2013/9/16
HIT-DBLAB
HIT-DBLAB
8
Record Data
• Data that consists of a collection of records, each of which consists of a fixed set of attributes
An element of the sequence
2013/9/1616
HIT-DBLAB
16
Ordered Data
• Spatio-Temporal Data
海量数据计算研究中心 • • • • • • • •
2013/9/16
Outline
数据? 为什么预处理? 描述性数据汇总 数据清理 数据集成和转换 数据归约 离散化和概念分层 总结
HIT-DBLAB
2
第二章 数据预处理
What is Data?
• Collection of data objects and their attributes • An attribute is a property or characteristic of an object