R语言主成分分析案例 附代码数据
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
R语言主成分分析案例
Question1
Q1.1:
> print(eigen_values)
[1] 2.4802416 0.9897652 0.3565632 0.1734301
Q1.2
> print(eigen_vectors)
[,1] [,2] [,3] [,4]
[1,] -0.5358995 0.4181809 -0.3412327 0.64922780
[2,] -0.5831836 0.1879856 -0.2681484 -0.74340748
[3,] -0.2781909 -0.8728062 -0.3780158 0.13387773
[4,] -0.5434321 -0.1673186 0.8177779 0.08902432
Q1.3
> print('variance for each eigen_values')
[1] "variance for each eigen_values"
> print(scores)
Comp.1 Comp.2 Comp.3 Comp.4
0.9655342206 0.027******* 0.0057995349 0.0008489079
Question2:
Q2.1:
See in code
Q2.2:
The result of ordinary linear regression:
> OLS
Call:
lm(formula = Apps ~ ., data = collegeTrainData)
Coefficients:
(Intercept) Private Accept Enroll Top10perc Top25perc F.Undergrad
-8.753e+02 -6.409e+02 1.345e+00 -2.841e-01 4.792e+01 -1.465e+01 1.980e-02
P.Undergrad Outstate Room.Board Books Personal PhD Terminal
-1.612e-03 -4.370e-02 2.831e-01 2.356e-01 8.284e-02 1.552e-01 -9.877e+00
S.F.Ratio perc.alumni Expend Grad.Rate
1.547e+01 -6.582e+00 6.118e-02 4.944e+00
And the result in terms of MSE and r-squared is;
> print(mse)
[1] 1454941
> print(rsqured)
[1] 0.9162122
Q2.3:
Use the lambda of seq(0, 1, 0.05) in r, which means from 0 to 1 by 0.05,
The result by ridge regression of cross validation is:
> print(mse)
[1] 1464329
> print(ridgeRsquared)
[1] 0.9156716
Which is slightly worse than the ordinary linear regression.
Q2.3:
Use the lambda of seq(0, 1, 0.05) in r, which means from 0 to 1 by 0.05,
The result by lasso regression of cross validation is:
> mse
[1] 1471047
> LassoRsquared
[1] 0.9152847
And I make the following table to compare the parameters by the three different models:
It can found that Lasso set the parameter of “Phd” to 0. Then it can be inferred that the adjusted r-square of Lasso regression is the best among the three models.
Question3:
Q3.1:
> h_1 = sd(F12)*(4/3/length(F12))^(1/5)
> h_1
[1] 0.3101212
Q3.2:
> min(F12)
[1] -2.995732
> max(F12)
[1] 7.930889
The min value of log_F12 is -2.99, the maximum value is 7.93. Therefore, I choose the sample from -3 to 8 by 0.05, the following is the plot of the estimated density.
Q3.3:
I choose 4 different bandwidth:
h_2 <- 0.1
h_3 <- 0.2
h_4 <- 0.5
h_5 <- 0.7
And the following plot can be get:
The middle one is the plot by question b.
And the numerical summary of the simulated density for the five different bandwidth
We can see that the larger bandwidth will cause a evener gentler distribution.。