03-DataPreprocessing-PartI(数据预处理)
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Interval new = a*old+b Ratio New = a*old
Discrete and Continuous Attributes
•Discrete Attributes: • Finite or countably infinite set of values • Categorical (zipcode, empIDs) or numeric (counts) • Often represented as integers • Special Case: binary attributes (yes/no, true/false, 0/1)
Mode, entropy, contingency
Ordinal
Median, percentiles, rank correlation, run tests, sign tests Mean, standard deviation, Pearson’s correlations, T/ F tests Geometric mean, harmonic mean, percent variation
Gender: 0 denotes male, 1 denotes female
•Asymmetric: if the states are not equally important
Medical Test: 0 denotes negative, 1 denotes positive
11
Attribute Properties
◦ Database data ◦ Data warehouse data ◦ Data streams ◦ Sequence data ◦ Graph ◦ Spatial data ◦ Text data
4
Database data
Data store in a database management system (DBMS)
• Temperatures in K, lengths, age, mass, …
10
Binary Attributes
•A nominal attribute with only two categories: 0 and 1 (or True/False)
•Symmetric: if both states of the variable are equally valuable
13
Interval Numeric Or Quantitative
Ratio
Temps in Kelvin, monetary quantities, counts, age, mass
Transformations
Type Categorical Or Qualitative Transformation Comments If all employee numbers are reassigned, it will not make a difference Nominal Any one to one mapping
Provide enough information to distinguish by name. =, ≠ Provide enough information to sort. <, > Differences between values are meaningful. +, Differences and ratios are meaningful *, /
•Preprocessing: modify the data to better fit data mining tools:
• Change length into short, medium, long • Reduce number of attributes
2
Data
•Collection of objects or records
•Each object described by a number of attributes
•Attribute: property of the object, whose value may vary from object to object or from time to time
Patient ID Gender DOB Systolic Diastolic Heart Rate Smoker … Yes No Yes
பைடு நூலகம்
Product (id, description, weight, unit)
Order(id, order_number, customer_id, product_id, quantity, price)
5
Data Warehouses
Data collected from multiple data sources Stored under a unified schema Usually residing at a single site Provide historical information Used for reporting
Data Preprocessing
Data – Things to consider
•Type of data: determines which tools to analyze the data
•Quality of data:
• Tolerate some levels of imperfection • Improve quality of data improves the quality of the results
•Continuous Attributes: • Real numbers • Examples: temperatures, height, weight, … • Practically, can be measured with limited precision
15
Asymmetric Attributes
8
Graph Data
Data structure represented by nodes (entities) and edges (relationships)
Example:
◦ Protein subsequences ◦ Web pages and links
b a
e
c
d
9
Attribute Types
Zip code, employee ID numbers, eye color, gender Hardness of minerals {good, better, best}, street numbers Calendar dates, temps in Celsius and Fahrenheit
Ordinal
Any order preserving function
{0.5, 1, 10} => {1, 2, 3}
Celsius to/from Fahrenheit Length can be measured in meters or feet
14
Numeric Or Quantitative
•Distinctness:
•Order:
= and ≠
<, ≤, ≥, and >
•Addition:
+ and -
•Multiplication: * and /
12
Type
Description
Examples
Operations
Nominal Categorical Or Qualitative
Examples:
•Web pages visited by a user (object): • {<Homepage>, <Electronics>, <Cameras and Camcorders>, <Digital Cameras>, …, <Shopping Cart>, <Order Confirmation>}, {….} •Transactions made by a customer over a period of time: • {t1, t18, t500, t721}, {t11, t38, t43, t621, t3005}
Collection of interrelated data and a software system to manage and access the data Relational model: Customer (id, firstname, lastname, address, city, state, phone)
6
Data Streams
A sequence of digital signals used for transmitting different kinds of content Sensor data: collecting gps/environment data and sending reading every tenth of a second Image data: satellite data, surveillance cameras Web traffic: a node on the Internet receives streams of IP packets
Document 1 Document 2 Document 3
timeout
season
coach
0 7 1
game
score
play
team
3 0 0
win
ball
0 2 0
lost
5 0 0
2 1 1
6 0 2
0 0 2
…
1029345 1029346 1029347 …
…
Male Male Female … 1/24/1957 151 5/3/1983 124 9/20/1991 110 92 80 74
…
62 66 54 …
3
What kind of Data?
Any data as long as it is meaningful for the target application
•Only presence of attribute is considered important
•Can be binary, discrete or continuous
•Examples:
• Words in document • Courses taken by students • Items purchased by customers
Output stream
1, 3, 2, 5, 7, 0, 1 a, r, b, c, a, l, u, 1, 0, 1, 1, 1, 0, 0, 1
Stream Processor
Archive
7
Sequence Data
Data obtained by:
◦ connecting all records associated with a certain object together ◦ and sorting them in increasing order of their timestamps ◦ for all objects
•Nominal: Differentiates between values based on names
• Gender, eye color, patient ID
•Ordinal: Allows a rank order for sorting data but does not describe the degree of difference
• {low, medium, high}, grades {A, B, C, D, F},
•Interval: Describes the degree of difference between values
• Dates, temperatures in C and F
•Ratio: Both degree or difference and ratio are meaningful