机器学习和数据挖掘lecture4

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Hongyuan Zha (Georgia Tech)
1 M
M
Jj2 (Tm )
wk.baidu.comm=1
9 / 32
Machine Learning and Web Search
Feature Selection for Document Classiﬁcation
Class-independent measures Document frequency (stop words) Term strength tf-idf Relative to a class or classes of documents Information gain Mutual information χ2 statistic Training or cross-validated accuracy of single-feature classiﬁers
Demonstrate the importance of graphing data before analyzing it, and of the eﬀect of outliers on the statistical properties of a dataset The following statistics are the same for all the four datasets — Mean and variance of X — Mean and variance of Y — Correlation between X and Y — Linear regression line y = 3 + 0.5x
Hongyuan Zha (Georgia Tech)
Machine Learning and Web Search
5 / 32
Anscombe’s Quartet
SVG drawing 7/5/09 7:44 AM
12 10
12 10
y1
8 6 4 4 6 8 10 12 14 16 18
y2
8 6 4 4 6 8 10 12 14 16 18
Hongyuan Zha (Georgia Tech)
Machine Learning and Web Search
2 / 32
Feature Selection
Machine Learning and data mining problems use many features — continuous features: ranking score — categorical features: entity type — ordinal features: relevance labels If predictive accuracy is the goal, often best to keep all predictors and use some regularization — assuming the features are not too noisy Feature selection: select the "most relevant" ones for predicting the target variable, to get sparse models — interpretability, speed, possibly better predictive accuracy
Hongyuan Zha (Georgia Tech)
Machine Learning and Web Search
7 / 32
Example: Mutual information
Can model nonlinear non-Gaussian dependencies I (Xj , Y ) = p (xj , y ) log p (xj , y ) dxj dy p (xj )p (y )
N
I (xji = a, y i = b )
i =1
Hongyuan Zha (Georgia Tech)
Machine Learning and Web Search
8 / 32
Relative Importance of Features
Inﬂuence of individual feature Xj on the variation of F (x ),
Algorithm 1: Forward stepwise selection (Orthogonal least squares)
1 2 3 4 5 6 7 8 9
! ! ", used ! ", unused ! ! to ! repeat " ! "#$ %"&! !unused #" ! ! ! ! ! # '#" # !(## move " from unused to used foreach # $ unused do #! ! #! # '#" ! ## (## #! ! #! $%%#! %% until stopping criterion is met
Machine Learning and Web Search
Part 4: Feature Selection and Dimension Reduction Hongyuan Zha
College of Computing Georgia Institute of Technology
Hongyuan Zha (Georgia Tech)
Recover correlation coeﬃcient if p (Xj , Y ) is Gaussian In the discrete case, I (Xj , Y ) =
xj y
p (xj , y ) log
p (xj , y ) p (xj )p (y )
Estimation from data (x i , y i ), i = 1, . . . , N , p (xj = a, y = b ) = 1 N
Hongyuan Zha (Georgia Tech)
Machine Learning and Web Search
11 / 32
Forward Selection for Linear Regression
!"#$%&%'()*)+(,#*,-.("+,
At each step, add feature that maximally reduces residual error • Once chosen k, project onto subspace orthogonal Once chosen k , project onto subspace orthogonal to 1, . . . , k to 1:k
Hongyuan Zha (Georgia Tech)
Machine Learning and Web Search
3 / 32
Types of Feature Selection
Number of variables — Univariate method: considers one feature at a time — Multivariate method: considers subsets of variables (features) together. Types of methods — Filter method: ranks features or feature subsets independently of the classiﬁer/regression function — Wrapper method: uses a classiﬁer/regression function to assess features or feature subsets
1/2
Jj = Ex
∂ F (x ) ∂ xj
L−1
2
Varx (xj )
Trees are piece-wise linear, need an approximation, Jj2 (T ) =
t =1
It2 1(vt = j )
T is a regression tree with L terminal nodes, the summation is over the non-terminal nodes — vt is the splitting variables of node t — It2 is the improvement in squared-error For a sum of trees from gradient boosting, Jj2 =
Hongyuan Zha (Georgia Tech)
Machine Learning and Web Search
10 / 32
Choosing Set of Features
Forward selection: Choose single highest scoring feature Xj
Rescore all features, conditioned on Xj being selected — Score(Xi ) = I (Xi , Y |Xj )
x1
x2
12 10
12 10
y3
8 6 4 4 6 8 10 12 14 16 18
y4
8 6 4 4 6 8 10 12 14 16 18
x3
x4
Hongyuan Zha (Georgia Tech)
Machine Learning and Web Search
6 / 32
Anscombe’s Quartet
Machine Learning and Web Search
1 / 32
Outline
1
Filter Methods
2
Model Selection Methods
3
Linear Dimension Reduction: PCA and SVD
4
Fisher Linear Discriminant
Repeat, calculating new conditioned scores on each iteration Backward selection: Start with all features, score each feature conditioned on assumption that all others are included. Then: – Remove feature with the lowest (conditioned) score — Rescore all features, conditioned on the new, reduced feature set – Repeat
Hongyuan Zha (Georgia Tech)
Machine Learning and Web Search
12 / 32
ear regression via the Lasso (Tibshirani, 19 Lasso and L1 regularization
Hongyuan Zha (Georgia Tech)
Machine Learning and Web Search
4 / 32
Filter Methods: Scoring Individual Features
X = [X1 , . . . , Xn ] input, Y target variable, (X , Y ) ∼ P (x , y ) Compute "relevance" of Xj (the j -th feature) to Y marginally — computationally eﬃcient Xj and Y independent: P (Xj , Y ) = P (Xj )P (Y ) Example: correlation coeﬃcient, measuring the extent to which Xj and Y are linearly related ρ(Xj , Y ) = Cov (Xj , Y ) Var (Xj )Var (Y )