03-DataPreprocessing-PartII(数据预处理)
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
9
Feature Selection
•Exhaustive enumeration is not practically feasible •Approaches:
• Filter: before the data mining algorithm is run • Embedded: attributes are selected in the process of the data mining algorithm execution • Wrapper approaches: every subset that is proposed by the subset selection measure is evaluated in the context of the learning algorithm
… …
… …
5
The Curse of Dimensionality
•Data analysis becomes harder as the dimensionality of data increases
•As dimensionality increases, data becomes increasingly sparse •Classification: not enough data to create a reliable model •Clustering: distances between points less meaningful
• accounts for max variance not accounted for by previous attributes • is uncorrelated with preceding attributes
15
PCA – Step 1
•Subtract the mean from each of the data dimensions
• Need:
• Subset generation strategy • Measure for evaluation and a stopping criterion
11
Feature Selection
•Heuristic: Remove predictors to ensure correlation is below a certain threshold
•That gives us lines that characterize the data
eigen <- eigen(cov) D<-eigen$values V<-eigen$vectors
18
PCA – Step 4
•Choose components:
• The eigenvector with the largest eigenvalue is the principal component of the data set • Order eigenvectors by eigenvalue from largest to smallest, this gives the components by order of significance • Select the first p eigenvectors, the final data will have p dimensions • Construct the feature vector Af = (eigv1 eigv2 … eigvp)
• Some data age quickly after it is collected • Example: Purchasing behavior of online shoppers • All patterns/models based on expired data are out of date • Quick response time: fraudulent transactions
• Time series of daily closing stock prices
6/29/2014
6/30/2014 1/1/1990 1/2/1990 1/3/1990 1/4/1990 … …
NDX ARO …
2500.61 3.45
2512.12 3.25
2615.84 … 3.64 …
•Obtain a new data set DM •This produces data with mean 0
D0
d DM
ห้องสมุดไป่ตู้
Original data matrix Vector of means New data matrix
#Subtract the mean DM <- scale(D0, scale=FALSE) #Verify colMeans(DM)
7
Techniques
•Feature selection:
• Select a subset of the attributes in the original data set
•Feature composition/creation:
• Creating new features based on existing ones
•Expresses data in a way that highlights differences and similarities •Reduces the number of dimensions without much loss of information
14
Principal Component Analysis
12
Feature Creation
•PCA
•Factor Analysis •Feature transformation:
• Categorical to binary • Binning of continuous data
13
Principal Component Analysis
•Identifies strong patterns in the data
•Relevance:
• Data must contain relevant information to the application
• Correlated attributes • Special values (Example: 9999 for unavailable) • Attribute types (Example: Categorical {1, 2, 3})
8
Feature Selection
•Redundant/Correlated features:
• contain duplicate information: price, tax amount
•Irrelevant features:
• No useful information: Phone number, employee id • Zero/Close to zero variance feature: not distinctive enough • Text mining: stop words: a, an, of, the, your, ….
10
Feature Selection
•General Approach:
• Generate a subset of features • Evaluate • Using an evaluation function for filter approaches • Using the data mining algorithm for wrapper approaches • If good, stop • Otherwise, update subset and repeat
•Sampling: select a subset of the objects to be analyzed
•Discretization: transform values into categorical form
•Dimensionality reduction
•Variable transformation
4
The Curse of Dimensionality
•Data sets with large number of features
• Document matrix
Doc1 Doc2 …
w1 w2 w3 w4 …
0 1 0 0 0 0 1 1 0 0
…
0 1 0 0 0 1 0 0 1 0 0 0 … …
3
•Knowledge about the data
Data Preprocessing
•Aggregation: combine multiple objects into one
• Reduce number of objects: less memory, less processing time • Quantitative attributes are combined using sum or average • Qualitative attributes are omitted or listed as a set • Reduces variability
•Fewer features for improving the data collection process
•Highly correlated features may affect accuracy/stability of algorithm
• Naï ve Bayes, Linear regression
•First dimension is chosen to capture the variability of the data
•Second dimension is chosen orthogonal to the first one so it captures as much of the remaining variability, and so on … •Each pair of new attributes has covariance 0 •The attributes are ordered in decreasing order of the variance in data that they capture •Each attribute:
Data Preprocessing (Part II)
Things to consider
• Data types & data exploration
• Data quality issues
• Transformations
2
Application Related Issues
•Timeliness:
16
PCA – Step 2
•Calculate the covariance matrix for DM: cov(DM)
cov = cov(DM)
17
PCA – Step 3
•Calculate the unit eigenvectors and eigenvalues of the covariance matrix
1. Calculate the correlation matrix of the predictors. 2. Determine the two predictors associated with the largest absolute pairwise correlation (call them predictors A and B). 3. Determine the average correlation between A and the other variables. 4. Do the same for predictor B. 5. If A has a larger average correlation, remove it; otherwise, remove predictor B. 6. Repeat Steps 2–4 until no absolute correlations are above the threshold
6
Why Feature Reduction?
•Less data so the data mining algorithm learns faster •Higher accuracy so the model can generalize better •Simple results so they are easier to understand