R语言线性回归案例报告R初始指令安装“汽车”包:install.packages(“汽车”)加载库汽车加载汽车中的数据:数据(Salaries,package =“car”)查看您的办公桌上的数据(屏幕):它所表示的薪水视图(帮助)和数据描述:help(薪水)变量的确切名称:名称(薪金)考虑谁是定量和定性的变量分散图虽然有些变量不是量化的,但相反,它们是绝对的,例如秩序是有序的,我们将要制作离散图考虑到图表,我们将运行变量之间的简单回归模型:“yrs.since.phd”“yrs.service”,但首先让我们来回顾一下变量之间的相关性。
因此,我们要确定假设正态性的相关系数考虑到这两个变量之间的相关性高,解释结果结果因变量yrs.since.phd的选择是正确的,请解释为什么编写表单的模型:y = intercept + oendiente * x解释截距和斜率假设检验根据测试结果,考虑到p值= 2e-16,是否拒绝了5%的显着性值的假设?斜率为零根据测试结果,考虑到p值= 2e-16,关于斜率的假设是否被拒绝了5%的显着性值?考虑到变量yrs.service在模型中是重要的模型和测试调整考虑到R的平方值为0.827,你认为该模型具有良好的线性拟合?解释调整的R平方值考虑到验证数据与模型拟合的测试由F统计得到:1894在1和395 DF,p值:<2.2e-16认为模型符合调整? ##使用模型进行估计为变量x的以下值查找yrs.since.phd的估计值:Graficas del modeloValidación del modeloEn este caso se desea determinar los residuos(error)entre el modelo y lo observado y ver si los residuos cumplen con: 1. Tener media=0 2. Varianza= constante 3. Se distribuyen normal(0,constante) Los residuos se generan a continuación (por facilidad solo generamos los primeros 20 [1:20])Valores ajustados, es decir el valor de la variable y dad0 por el modelo (por facilidad solo generamos los primeros 20 [1:20])Prueba de normalidad, utilizamos la prueba QQ que permite deducir si los datos se ajustan a la normal (si los datos estan cerca de la línea)Validación del modeloEn este caso se desea determinar los residuos(error)entre el modelo y lo observado y ver si los residuos cumplen con: 1. Tener media=0 2. Varianza= constante 3. Se distribuyen normal(0,constante) Los residuos se generan a continuación (por facilidad solo generamos los primeros 20 [1:20])Valores ajustados, es decir el valor de la variable y dad0 por el modelo (por facilidad solo generamos los primeros 20 [1:20])Prueba de normalidad, utilizamos la prueba QQ que permite deducir si los datos se ajustan a la normal (si los datos estan cerca de la línea)condidera que los que estan a más de dos desviaciones estandar se distribuyen normal?。
多重相关创建数值变量的数据框1.2.3.Data.num $ Status = as.numeric(Data.num $ Status)4.5.Data.num $ Length = as.numeric(Data.num $ Length)6.7.Data.num $ Migr = as.numeric(Data.num $ Migr)8.9.Data.num $ Insect = as.numeric(Data.num $ Insect)10.11.Data.num $ Diet = as.numeric(Data.num $ Diet)12.13.Data.num $ Broods = as.numeric(Data.num $ Broods)14.15.Data。
num $ Wood = as.numeric(Data.num $ Wood)16.17.Data.num $ Upland = as.numeric(Data.num $ Upland)18.19.Data.num $ Water = as.numeric(Data.num $ Water)20.21.Data.num $ Release = as.numeric(Data.num $ Release)22.23.Data.num $ Indiv = as.numeric(Data.num $ Indiv)24.25.###检查新数据框架26.27.headtail(Data.num)28.29. 1 1 1520 9600.0 1.21 1 12 2 6.0 1 0 0 1 6 2930.31. 2 1 1250 5000.0 0.56 1 0 1 6.0 1 0 0 1 10 8532.33. 3 1 870 3360.0 0.07 1 0 1 4.0 1 0 0 1 3 834.35.77 0 170 31.0 0.55 3 12 2 4.0 NA 1 0 0 1 236.37.78 0 210 36.9 2.00 2 8 2 3.7 1 0 0 1 1 238.39.79 0 225 106.5 1.20 2 12 2 4.8 2 0 0 0 1 240.41.检查变量之间的相关性42.43.###注意我在这里使用了Spearman相关多个逻辑回归的例子在此示例中,数据包含缺失值。
2.清理数据2.a放入数据列pimalm<-lm(class~npreg+glucose+bp+triceps+insulin+bmi+dia betes+age, data=pima)去除大p值的变量(p值> 0.005)Remove variables (insulin, age) with large p value (p value > 0.005) After the variables are dropped, the R-squared value remain about the same. This suggests the variables dropped do not have much effect on the model.Residual analysis shows almost straight line with distribution around zero. Due to this pattern, this model is not as robust.qqnorm(resid(pimalm), col="blue")qqline(resid(pimalm), col="red")The second dataset with much simpler variables. Although intuitively the variables both effect the output, the amount of effect by each variable is interesting. This dataset was examined to have a better sense of how multivariate regression will perform.allbacks.lm<-lm(weight~volume+area, data=allbacks) summary(allbacks.lm)qqnorm(resid(allbacks.lm), col="blue") qqline(resid(allbacks.lm), col="red")。
确保参数na.strings等于c("")使每个缺失值编码为a NA。
training.data.raw < - read.csv('train.csv',header = T,na.strings = c(“”))现在我们需要检查缺失的值,并查看每个变量的唯一值,使用sapply()函数将函数作为参数传递给数据框的每一列。
R语言数据分析回归研究案例:移民政策偏好是否有准确的刻板印象?数据重命名,重新编码,重组Group <chr> Count<dbl>Percent<dbl>6 476 56.00 5 179 21.062 60 7.063 54 6.354 46 5.41 1 27 3.18 0 8 0.94对Kirkegaard&Bjerrekær2016的再分析确定用于本研究的32个国家的子集的总体准确性。
#降低样本的#精确度GG_scatter(dk_fiscal, "mean_estimate", "dk_benefits_use",GG_scatter(dk_fiscal_sub, "mean_estimate", "dk_benefits_us e", case_names="Names")GG_scatter(dk_fiscal, "mean_estimate", "dk_fiscal", case_n ames="Names")#compare Muslim bias measures#can we make a bias measure that works without ratio scaleScore stereotype accuracy#add metric to main datad$stereotype_accuracy=indi_accuracy$pearson_rGG_save("figures/aggr_retest_stereotypes.png")GG_save("figures/aggregate_accuracy.png")GG_save("figures/aggregate_accuracy_no_SYR.png")Muslim bias in aggregate dataGG_save("figures/aggregate_muslim_bias.png")Immigrant preferences and stereotypesGG_save("figures/aggregate_muslim_bias_old_data.png") Immigrant preferences and stereotypesGG_save("figures/aggr_fiscal_net_opposition_no_SYR.png")GG_save("figures/aggr_stereotype_net_opposition.png")GG_save("figures/aggr_stereotype_net_opposition_no_SYR.pn g")lhs <chr>op<chr > rhs <chr> est <dbl> se <dbl> z <dbl> pvalue <dbl> net_opposition ~ mean_estimate_fiscal -4.4e-01 0.02303 -19.17 0.0e+00net_opposition~Muslim_frac 4.3e-02 0.05473 0.79 4.3e-01net_opposition~~net_opposition 6.9e-03 0.00175 3.94 8.3e-05dk_fiscal ~~ dk_fiscal 6.2e+03 0.00000 NA NAMuslim_frac~~Muslim_frac1.7e-01 0.0000NANAIndividual level modelsGG_scatter(example_muslim_bias, "Muslim", "resid", case_na mes="name")+#exclude Syria#distributiondescribe(d$Muslim_bias_r)%>%print()GG_save("figures/muslim_bias_dist.png")## `stat_bin()` using `bins = 30`. Pick better value with `GG_scatter(mediation_example, "Muslim", "resid", case_name s="name", repel_names=T)+scale_x_continuous("Muslim % in home country", labels=scal#stereotypes and preferencesmediation_model=plyr::ldply(seq_along_rows(d), function(rGG_denhist(mediation_model, "Muslim_resid_OLS", vline=medi an)## `stat_bin()` using `bins = 30`. Pick better value with `add to main datad$Muslim_preference=mediation_model$Muslim_resid_OLS Predictors of individual primary outcomes#party modelsrms::ols(stereotype_accuracy~party_vote, data=d)GG_group_means(d, "Muslim_bias_r", "party_vote")+ theme(axis.text.x=element_text(angle=-30, hjust=0))GG_group_means(d, "Muslim_preference", "party_vote")+#party agreement cors wtd.cors(d_parties)。
能够描述两个数值变量(例如上面的runand at_bats)的关系也是有用的。
# Click two points to make a line.After running this command, you’ll be prompted to click two points on the plot to define a line. Once you’ve done that, the line you specified will be shown in black and the residuals in blue. Note that there are 30 residuals, one for each of the 30 observations. Recall that the residuals are the difference between the observed values and the values predicted by the line:e i=y i−y^i ei=yi−y^iThe most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, you can rerun the plot command and add the argument showSquares = TRUE.## Click two points to make a line.Note that the output from the plot_ss function provides you with the slope and intercept of your line as well as the sum of squares.Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?Answer: The smallest sum of squares is 123721.9. It explains the dispersion from mean. The linear modelIt is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error. Instead we can use the lm function in R to fit the linear model (a.k.a. regression line).The first argument in the function lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of at_bats. The second argument specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this information using the summary function.Let’s consider this output piece by piece. First, the formula used to describe the model is shown at the top. After the formula you find the five-number summary of the residuals. The “Coefficients” table shown next is key; its first column displays the linear model’s y-intercept and the coefficient of at_bats. With this table, we can write down the least squares regression line for the linear model:y^=−2789.2429+0.6305∗atbats y^=−2789.2429+0.6305∗atbatsOne last piece of information we will discuss from the summary output is the MultipleR-squared, or more simply, R2R2. The R2R2value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 37.3% of the variability in runs is explained by at-bats.output, write the equation of the regression line. What does the slope tell us in thecontext of the relationship between success of a team and its home runs?Answer: homeruns has positive relationship with runs, which means 1 homeruns increase 1.835 times runs.Prediction and prediction errors Let’s create a scatterplot with the least squares line laid on top.The function abline plots a line based on its slope and intercept. Here, we used a shortcut by providing the model m1, which contains both parameter estimates. This line can be used to predict y y at any value of x x. When predictions are made for values of x x that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for thisprediction?Model diagnosticsTo assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.Linearity: You already checked if the relationship between runs and at-bats is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. at-bats. Recall that any code following a # is intended to be a comment that helps understand the code but is ignored by R.6.Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?Answer: the residuals has normal linearity of the relationship between runs ans at-bats, which mean is 0.Nearly normal residuals: To check this condition, we can look at a histogramor a normal probability plot of the residuals.7.Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?Answer: Yes.It’s nearly normal.Constant variability:1. Choose another traditional variable from mlb11 that you think might be a goodpredictor of runs. Produce a scatterplot of the two variables and fit a linear model. Ata glance, does there seem to be a linear relationship?Answer: Yes, the scatterplot shows they have a linear relationship..1.How does this relationship compare to the relationship between runs and at_bats?Use the R22 values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?1. Now that you can summarize the linear relationship between two variables, investigatethe relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical andnumerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).Answer: The new_obs is the best predicts runs since it has smallest Std. Error, which the points are on or very close to the line.1.Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical andnumerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?Answer: ‘new_slug’ as 87.85% ,‘new_onbase’ as 77.85% ,and ‘new_obs’ as 68.84% are predicte better on ‘runs’ than old variables.1. Check the model diagnostics for the regression model with the variable you decidedwas the best predictor for runs.This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.。
Chapter2DJM30January2018What is this chapter about?Problems with regression,and in particular,linear regressionA quick overview:1.The truth is almost never linear.2.Collinearity can cause difficulties for numerics and interpretation.3.The estimator depends strongly on the marginal distribution of X.4.Leaving out important variables is bad.5.Noisy measurements of variables can be bad,but it may not matter. ```Rx <- seq(0, 10, length.out = 100)y <- sin(x) + rnorm(100, sd = 0.1)df <- data.frame(x = x, y = y)```这里我们生成了一个包含100个数据点的数据框,其中x的取值范围是0到10,y的取值是在sin(x)的基础上加上一个标准差为0.1的正态分布噪声。
```Rfit <- lm(y ~ I(x^2) + x, data = df)```这里我们拟合了一个二次多项式,模型的形式为y = a + bx + cx^2,其中a、b、c是拟合出来的系数。
```Rplot(df$x, df$y)lines(df$x, predict(fit), col = "red")```这里我们先绘制了原始数据的散点图,然后使用lines函数绘制了拟合出来的曲线。
汽车趋势大体上是对这个具体问题的答案的本质感兴趣:* MPG的自动或手动变速箱更好吗?*量化自动和手动变速器之间的手脉差异。
分析表明,通过使用我们的最佳拟合模型来解释哪些变量解释了MPG 的大部分变化,我们可以看到手册允许我们以每加仑2.97多的速度驱动。
由于simplistisc模型显示传播只能解释MPG变异的35%(AppendiX A.2。
### 2.模型测试(线性回归和多变量回归)从Anova分析中我们可以看出,仅仅接受变速箱作为与油耗相关的唯一变量的模型将是一个误解。
一个F = 62.11告诉我们,如果零假设是真的,那么这个大的F比率的可能性小于0.1%的显着性是可能的,因此我们可以得出结论:模型2显然是一个比油耗更好的预测值仅考虑传输。
R语言分段回归数据数据分析案例报告# 读取数据data=read.csv("artificial-cover.csv")# 查看部分数据head(data)## tree.cover shurb.grass.cover## 1 13.2 16.8## 2 17.2 21.8## 3 45.4 48.8## 4 53.6 58.7## 5 58.5 55.5## 6 63.3 47.2#######先调用spline包library ( splines )###########用lm拟合,主要注意部分是bs(age,knots=c(...))这部分把自变量分成不同部分fit =lm(tree.cover~bs(shurb.grass.cover ,knots =c(25 ,40 ,60) ),data=da ta )############进行预测,预测数据也要分区pred=predict (fit , newdata =list(shurb.grass.cover =data$shurb.grass. cover),se=T)#############然后画图plot(fit)# 可以构造一个相对复杂的 LOWESS 模型(span参数取小一些),然后和一个简单的模型比较,如:x<-data$shurb.grass.covery<-data$tree.coverplot(x,y,type="l",col=2)fit3 =loess(y ~x, span =0.2)fit4 =loess(y ~1 +x +I(x >30) +I((x -30) *(x >30)),span =1, degree =1)par(mar =c(4, 4, 0, 0), family ="serif", mgp =c(2, 1, 0))plot(x, y, pch =20, col ="darkgray")lines(x, fitted(fit3), lwd =2, col =2)lines(x, fitted(fit4), lwd =2, lty =2)library(ggplot2)## Warning: package 'ggplot2' was built under R version 3.3.3qplot(x, y) +geom_smooth() # 总趋势## `geom_smooth()` using method = 'loess'qplot(x, y, group = x >30) +geom_smooth() +theme(panel.background = element_rect(fill ='white', colour ='black')) # 按30前后分组## `geom_smooth()` using method = 'loess'# 其他数据# 读取数据data=read.csv("其他数据.csv")# 查看部分数据data=data[,1:4]head(data)## year Soil vegetation SEM ## 1 1999 -3.483724 -2.528836 2.681003 ## 2 1999 -3.452582 -2.418049 2.348640 ## 3 1999 -3.350827 -2.590552 2.696037 ## 4 1999 -3.740395 -2.933848 3.627112 ## 5 1999 -3.465906 -2.694211 2.333755 ## 6 1999 -3.381802 -2.788154 2.656276 #####因变量 Soil#######先调用spline包library ( splines )###########用lm拟合,主要注意部分是bs(age,knots=c(...))这部分把自变量分成不同部分fit =lm(Soil~bs(vegetation ,knots =c(-2 ,0 ,1) ),data=data )############进行预测,预测数据也要分区pred=predict (fit , newdata =list(vegetation =data$vegetation),se=T) #############然后画图plot(fit)# 可以构造一个相对复杂的 LOWESS 模型(span 参数取小一些),然后和一个简单的模型比较,如:x<-data$vegetationy<-data$Soilplot(x,y,type="l",col=2)fit3 =loess(y ~x, span =0.2)fit4 =loess(y ~1 +x +I(x >0) +I((x -0) *(x >0)),span =1, degree =1)par(mar =c(4, 4, 0, 0), family ="serif", mgp =c(2, 1, 0)) plot(x, y, pch =20, col ="darkgray")lines(x, fitted(fit3), lwd =2, col =2)lines(x, fitted(fit4), lwd =2, lty =2)library(ggplot2)qplot(x, y) +geom_smooth() +theme(panel.background =element_rect(fil l ='white', colour ='black')) # 按30前后分组## `geom_smooth()` using method = 'loess'# 总趋势qplot(x, y, group = x >0) +geom_smooth() +theme(panel.background =e lement_rect(fill ='white', colour ='black')) # 按30前后分组## `geom_smooth()` using method = 'loess'# 按0前后分组#####因变量 SEM#######先调用spline包library ( splines )###########用lm拟合,主要注意部分是bs(age,knots=c(...))这部分把自变量分成不同部分fit =lm(SEM~bs(vegetation ,knots =c(-2 ,0 ,1) ),data=data )############进行预测,预测数据也要分区pred=predict (fit , newdata =list(vegetation =data$vegetation),se=T) #############然后画图plot(fit)# 可以构造一个相对复杂的 LOWESS 模型(span 参数取小一些),然后和一个简单的模型比较,如:x<-data$vegetationy<-data$SEMplot(x,y,type="l",col=2)fit3 =loess(y ~x, span =0.2)fit4 =loess(y ~1 +x +I(x >0) +I((x -0) *(x >0)),span =1, degree =1)par(mar =c(4, 4, 0, 0), family ="serif", mgp =c(2, 1, 0)) plot(x, y, pch =20, col ="darkgray")lines(x, fitted(fit3), lwd =2, col =2)lines(x, fitted(fit4), lwd =2, lty =2)library(ggplot2)qplot(x, y) +geom_smooth()+theme(panel.background =element_rect(fill ='white', colour ='black')) # 按30前后分组## `geom_smooth()` using method = 'loess'# 总趋势qplot(x, y, group = x >0) +geom_smooth() +theme(panel.background =e lement_rect(fill ='white', colour ='black')) # 按30前后分组## `geom_smooth()` using method = 'loess'# 按0前后分组NA。
可以使用loess()数值向量来进行LOESS回归,以使其平滑并在局部(即,在训练值X s内)预测Y. 邻域的大小可以使用span参数来控制,范围在0到1之间。
【原创】R语言报告论文(附代码数据)有问题到淘宝找“大数据部落”就可以了对于这种情况,最好的span结果是0.05433SSE和最小的SSE 3.85e-28。
1. 数据获取和处理
2. OLS回归分析
3. 时间序列分析
4. 结论与反思
转载⼏个R语⾔中实现Logistic回归模型的案例案例⼀:本⽂⽤例来⾃于John Maindonald所著的《Data Analysis and Graphics Using R》⼀书,其中所⽤的数据集是anesthetic,数据集来⾃于⼀组医学数据,其中变量conc表⽰⿇醉剂的⽤量,move则表⽰⼿术病⼈是否有所移动,⽽我们⽤nomove做为因变量,因为研究的重点在于conc的增加是否会使nomove的概率增加。
⾸先载⼊数据集并读取部分⽂件,为了观察两个变量之间关系,我们可以利cdplot函数来绘制条件密度图install.packages('DAAG')library(lattice)library(DAAG)head(anesthetic)move conc logconc nomove1 0 1.0 0.0000000 12 1 1.2 0.1823216 03 0 1.4 0.3364722 14 1 1.4 0.3364722 05 1 1.2 0.1823216 06 0 2.5 0.9162907 1cdplot(factor(nomove)~conc,data=anesthetic,main='条件密度图',ylab='病⼈移动',xlab='⿇醉剂量')从图中可见,随着⿇醉剂量加⼤,⼿术病⼈倾向于静⽌。
anes1=glm(nomove~conc,family=binomial(link='logit'),data=anesthetic)summary(anes1)结果显⽰:Call:glm(formula = nomove ~ conc, family = binomial(link = 'logit'),data = anesthetic)Deviance Residuals:Min 1Q Median 3Q Max-1.76666 -0.74407 0.03413 0.68666 2.06900Coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) -6.469 2.418 -2.675 0.00748 **conc 5.567 2.044 2.724 0.00645 **---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1)Null deviance: 41.455 on 29 degrees of freedomResidual deviance: 27.754 on 28 degrees of freedomAIC: 31.754Number of Fisher Scoring iterations: 5下⾯做出模型的ROC曲线anes1=glm(nomove~conc,family=binomial(link='logit'),data=anesthetic)对模型做出预测结果pre=predict(anes1,type='response')将预测概率pre和实际结果放在⼀个数据框中data=data.frame(prob=pre,obs=anesthetic$nomove)将预测概率按照从低到⾼排序data=data[order(data$prob),]n=nrow(data)tpr=fpr=rep(0,n)根据不同的临界值threshold来计算TPR和FPR,之后绘制成图for (i in 1:n){threshold=data$prob[i]tp=sum(data$prob>threshold&data$obs==1)fp=sum(data$prob>threshold&data$obs==0)tn=sum(data$probfn=sum(data$probtpr[i]=tp/(tp+fn) #真正率fpr[i]=fp/(tn+fp) #假正率}plot(fpr,tpr,type='l')abline(a=0,b=1)R中也有专门绘制ROC曲线的包,如常见的ROCR包,它不仅可以⽤来画图,还能计算ROC曲线下⾯⾯积AUC,以评价分类器的综合性能,该数值取0-1之间,越⼤越好。
在这个例子中,我们的模型是“mpg = β0 + β1 * hp + β2 * wt”,其中“β0”是截距,“β1”和“β2”是系数。
根据输出结果,我们可以得出以下结论:1、马力每增加1个单位,每加仑英里数平均增加0.062个单位(β1的95%置信区间为[0.022, 0.102]);2、车重每增加1个单位,每加仑英里数平均减少0.053个单位(β2的95%置信区间为[-0.077, -0.030])。
R-squared 的值越接近1,说明模型对数据的解释能力越强。
最简单的回归是线性回归,在此借用Andrew NG的讲义,有如图1.a所示,X为数据点——肿瘤的大小,Y为观测值——是否是恶性肿瘤。
通过构建线性回归模型,如h θ (x)所示,构建线性回归模型后,即可以根据肿瘤大小,预测是否为恶性肿瘤h θ(x)≥.05为恶性,h θ (x)<0.5为良性。
Z i=ln(P i1−P i)=β0+β1x1+..+βn x n Zi=ln(Pi1−Pi)=β0+β1x1+..+βnxn数据描述用R语言做logistic regression,建模及分析报告,得出结论,数据有一些小问题,现已改正重发:改成以“是否有汽车购买意愿(1买0不买)”为因变量,以其他的一些项目为自变量,来建模分析,目的是研究哪些变量对用户的汽车购买行为的影响较为显著。
在这个过程中,我们将:1.导入数据2.检查类别偏差3.创建训练和测试样本4.建立logit模型并预测测试数据5.模型诊断数据描述分析查看部分数据head(inputData)是否有汽车购买意愿.1买0不买. 区域城市人均地区生产总值.元.1 NA NA2 NA NA3 0 中部长沙 1078904 0 中部长沙 1078905 0 中部长沙 1078906 0 中部长沙 107890职工平均工资.元. 全市总人口.万人. 全市面积.平方公里.1 NA NA NA2 NA NA NA3 56383.16 662.8 118164 56383.16 662.8 118165 56383.16 662.8 118166 56383.16 662.8 11816全市人口密度.人.平方公里. 市区总人口.万人. 市区面积.平方公里.1 NA NA NA2 NA NA NA3 560.94 299.3 19104 560.94 299.3 19105 560.94 299.3 19106 560.94 299.3 1910市区人口密度.人.平方公里. 城市道路面积.万平方米. 公共汽.电.车车辆数.辆.1 NA NA NA2 NA NA NA3 1566.75 29964 1574 1566.75 2996 4 1575 1566.75 2996 4 1576 1566.75 2996 4 157公交客运总量.万人次. 出租汽车数.辆. 每万人拥有公共汽车.辆.1 NA NA NA2 NA NA NA3 73943 6915 13.894 73943 6915 13.895 73943 6915 13.896 73943 6915 13.89人均城市道路面积.平方米. 私人汽车保有量.辆. 地铁条数地铁长度1 NA NA NA NA2 NA NA NA NA3 10.01 1200000 0 04 10.01 1200000 0 05 10.01 1200000 0 06 10.01 1200000 0 0日平均温度.F.的平均值日最高温度.F.的最大值日最高温度.F.的平均值1 NA NA NA2 NA NA NA3 64.42 104 71.54 64.42 104 71.55 64.42 104 71.56 64.42 104 71.5日最低温度.F.的平均值日最低温度.F.的最小值日最高温低于0度天数1 NA NA NA2 NA NA NA3 57.3 26 04 57.3 26 05 57.3 26 06 57.3 26 0日最低温低于0度天数日最高温高于30度天数下雨天数住房数性别.1男2女.1 NA NA NA NA N A2 NA NA NA NA N A3 22 95 173 2 14 22 95 173 2 25 22 95 173 3 16 22 95 173 1 1年龄职业类型学生.1代表是.后同. 蓝领白领.粉领其他职业或无职业1 NA NA NA NA NA NA2 NA NA NA NA NA NA3 404 0 0 1 04 30 4 0 0 1 05 26 4 0 0 1 06 30 2 0 0 1 0电动自行车数量汽车数量摩托车数量有驾照司机数成人数儿童数在家1 NA NA NA2 NA NA NA3 1 1 0 1 2 1 54 2 1 1 2 2 1 55 1 1 1 1 2 1 56 3 0 1 0 3 0 5上学工作家庭收入行程出行时间 X 购买时间购买时间.1 购买时间.21 NA NA NA NA NA2 NA NA NA NA NA3 4 5 10.0 0.63 NA 2009 20114 2 11 20.0 0.25 NA 2009 2009 2008.0005 5 11 2.0 0.12 NA 2011 NA6 2 3 2.7 0.17 NA 2009 2011购买时间.3 购买时间.4 购买时间.5 购买时间.61 NA NA NA2 NA NA NA3 NA NA NA4 NA NA NA5 NA NA NA6 NA NA NA查看数据维度[1] 948 56对数据进行描述统计分析:是否有汽车购买意愿.1买0不买. 区域城市Min. :0.0000 东部 :414 安庆 : 371st Qu.:0.0000 南部 :122 青岛 : 27Median :0.0000 北部 :121 镇江 : 27Mean :0.2144 中部 : 81 柳州 : 263rd Qu.:0.0000 西北 : 74 唐山 : 26Max. :1.0000 西南 : 68 赤峰 : 24NA's :20 (Other): 68 (Other):781人均地区生产总值.元. 职工平均工资.元. 全市总人口.万人. 全市面积.平方公里. Min. : 17096 Min. :32183 Min. : 53.6 Min. : 761 1st Qu.: 36340 1st Qu.:41305 1st Qu.: 345.9 1st Qu.: 7615 Median : 54034 Median :48270 Median : 613.3 Median :12065 Mean : 63605 Mean :49529 Mean : 635.4 Mean :15970 3rd Qu.: 84699 3rd Qu.:54211 3rd Qu.: 759.7 3rd Qu.:16757 Max. :155690 Max. :93997 Max. :3358.4 Max. :90021。
如果我们使用线性回归来模拟二分变量(作为Y),则得到的模型可能不会将预测的Y s限制在0和1之内。
因此,我们建模事件ln的对数几率(P1 - P.)升ñ(P1- P),其中,P是事件的概率。
#load data
kb16=read_rds("data/Kirkegaard and Bjerrekær 2016 data.rds")
kb16_aggr=read_rds("data/Kirkegaard and Bjerrekær 2016 data aggr.rds")%>%
d_controls%<>%score_items(c("Samme antal indvandrere","Færre indvandrere","Negativt","Positivt",T))
## No exact match: Det tidligere Jugoslavien
## Best fuzzy match found: Det tidligere Jugoslavien -> Jugoslavien with distance 14.00
#sort variables by name function
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
可以使用loess()数值向量来进行LOESS回归,以使其平滑并在局部(即,在训练值X s内)预测Y. 邻域的大小可以使用span参数来控制,范围在0到1之间。