回归Chapter 6
伍德里奇 第六章
R2 = 1−
SSR /(n − k − 1) SSR n − 1 n −1 = 1− ⋅ = 1 − (1 − R 2 ) ⋅ SST /(n − 1) SST n − k − 1 n − k −1
n −1 n − k −1
2
自由度惩罚因子:
在方程中增加一个(组)x,R2 一定增加,但只有该 x 的 t>1(F>1) R 才增加。 ,
2 3
三、含交叉项的模型 粮食产量=β 0 + β1 (施肥量)+β(降水量)+β 3 (施肥量 × 降水量)+随机误差项 2
yi = β 0 + β1 xi + β 2 zi + β3 xi zi + ui
或
E ( y | xi , zi )=β 0 + β1 xi +β 2 zi +β3 xi zi
(B)
将(B)式中的 X 用
yt − yt −1 替换,则(A)式可近似地表示为 yt −1
Δ ln yt ≈
yt − yt −1 yt −1
即表明对数变化相当于变量的变化率(相对变化) 。 该近似计算在系数数值小时比较准确, 但数值大的时候, 不太准确。 准确的变化率: 由于 ln y = β 0 + β1 x1 + β 2 x2 + .... + u ,如果 x1 变化一个绝对单位,其他 x 保持不 变,则有
ji j i
2
2
ˆ βj
ˆ 普通最小二乘回归系数 β j ( j = 1, 2, L , k ) 表示在其它变量不变的情况下,自变量 ˆ x j 每变化一个绝对单位引起的因变量的平均变化量;标准化回归系数 b j 表示在其它变
最新应用回归分析-第6章课后习题参考答案
第6章多重共线性的情形及其处理思考与练习参考答案6.1 试举一个产生多重共线性的经济实例。
答:例如有人建立某地区粮食产量回归模型,以粮食产量为因变量Y,化肥用量为X1,水浇地面积为X2,农业投入资金为X3。
由于农业投入资金X3与化肥用量X1,水浇地面积X2有很强的相关性,所以回归方程效果会很差。
再例如根据某行业企业数据资料拟合此行业的生产函数时,资本投入、劳动力投入、资金投入与能源供应都与企业的生产规模有关,往往出现高度相关情况,大企业二者都大,小企业都小。
6.2多重共线性对回归参数的估计有何影响?答:1、完全共线性下参数估计量不存在;2、近似共线性下OLS估计量非有效;3、参数估计量经济含义不合理;4、变量的显著性检验失去意义;5、模型的预测功能失效。
6.3 具有严重多重共线性的回归方程能不能用来做经济预测?答:虽然参数估计值方差的变大容易使区间预测的“区间”变大,使预测失去意义。
但如果利用模型去做经济预测,只要保证自变量的相关类型在未来期中一直保持不变,即使回归模型中包含严重多重共线性的变量,也可以得到较好预测结果;否则会对经济预测产生严重的影响。
6.4多重共线性的产生于样本容量的个数n、自变量的个数p有无关系?答:有关系,增加样本容量不能消除模型中的多重共线性,但能适当消除多重共线性造成的后果。
当自变量的个数p较大时,一般多重共线性容易发生,所以自变量应选择少而精。
6.5 自己找一个经济问题来建立多元线性回归模型,怎样选择变量和构造设计矩阵X才可能避免多重共线性的出现?答:请参考第三次上机实验题——机场吞吐量的多元线性回归模型,注意利用二手数据很难避免多重共线性的出现,所以一般利用逐步回归和主成分回归消除多重共线性。
如果进行自己进行试验设计如正交试验设计,并收集数据,选择向量使设计矩阵X 的列向量(即X 1,X 2, X p )不相关。
6.6对第5章习题9财政收入的数据分析多重共线性,并根据多重共线性剔除变量。
Chapter6 多元回归分析:
6.1数据的测度单位对OLS统计量的影响 6.2对函数形式的进一步讨论 6.3拟合优度和回归元选择的进一步探讨 6.4预测和残差分析
6.1 数据的测度单位对OLS统计量的影响 6.1.1 例子的说明
考察改变因变量和自变量的测度单位对系数、 标准误、t统计量、F统计量和置信区间的影 响。
对于模型:y = 0 + 1x + 2x2 + u,我们 无法将1解释为y相对于x的变化,但是可以注 意到,
dyˆ dx
ˆ1
2ˆ2 x
它的符号衡量了x对 y具有正向还是负向
作用,一般为正
它的符号衡量了x对y的作 用是递增的还是递减的
(边际效用),一般为负
Example:工资模型
习题6.3
6.3 拟合优度和回归元选择的进一步探讨
R2是出现在模型中解释变量个数的非减函 数
(1)bwght:以盎司为单位的孩子出生时的体重; cigs:母亲在怀孕期间吸烟的数量(支); faminc:以千美元为单位的家庭年收入
(2)以磅而不是盎司来度量婴儿的出生体重,令 bwghtlbs=bwght/16。
(3)把因变量换回它原来的度量单位;改变cigs 的度量单位。定义packs为每天吸烟的包数,所 以packs=cigs/20
如果水平项和二次项都为负,则说明x的增 加始终使得y有负的递减影响。
这两种情况都是不存在极值点的。
习题6.5
6.2.3 含有交互作用的模型
例子:
交互作用项
price 0 1sqrft 2bdrms 3sqrft bdrms 4bthrms u
bdrms对price的偏效应为 当sqrf=0时,bdrm对
回归分析 ppt课件
回归分析
9
回归分析
1.模型拟合情况: 模型的拟合情况反映了模型对数据的解释能力。修正
的可决系数(调整R方)越大,模型的解释能力越强。
观察结果1,模型的拟合优度也就是对数据的解释能力一般,修正的 决定系数为0.326;
10
回归分析
2.方差分析: 方差分析反映了模型整体的显著性,一般将模型的检验
19
回归分析
曲线回归分析只适用于模型只有一个自变量且可以化为 线性形式的情形,并且只有11种固定曲线函数可供选择,而 实际问题更为复杂,使用曲线回归分析便无法做出准确的分 析,这时候就需用到非线性回归分析。它是一种功能更强大 的处理非线性问题的方法,可以使用用户自定义任意形式的 函数,从而更加准确地描述变量之间的关系。
回归分析
1
回归分析
•寻求有关联(相关)的变量之间的关系,是指 通过提供变量之间的数学表达式来定量描述变 量间相关关系的数学过程。
•主要内容:
1.从一组样本数据出发,确定这些变量间的定量关系式; 2.对这些关系式的可信度进行各种统计检验 3.从影响某一变量的诸多变量中,判断哪些变量的影响显著, 哪些不显著 4.利用求得的关系式进行预测和控制
观察结果3,模型中的常数项是3.601,t值为24.205,显著性为 0.000;通货膨胀的系数是0.157, t值为2.315,显著性为0.049。所 12以,两个结果都是显著的。
回归分析
结论:
一元线性回归方程: y=a+bx
写出最终模型的表达式为: R(失业率)=3.601+0.157*I(通货膨胀率) 这意味着通货膨胀率每增加一点,失业率就增加 0.157点;
P值(Sig)与0.05作比较,如果小于0.05,即为显著。
北大暑期课程《回归分析》(Linear Regression Analysis)讲义PKU6
Class 6: Auxiliary regression and partial regression plots.More independent variables?I. Consequences of Including Irrelevant Independent VariablesWhat are the consequences of including irrelevant independent variables? In otherwords, should we always include as many independent variables as possible?The answer is no. You should always have good reasons for including your independentvariables. Do not include irrelevant independent variables. There are four reasons:A. Missing Theoretically Interesting FindingsB. Violating the Parsimony Rule (Occom's Razor)C. Wasting Degrees of FreedomD. Making Estimates Imprecise. (e.g., through collinearity).Conclude: Inclusion of irrelevant variables reduces the precision of estimation. II. Consequences of Omitting Relevant Independent Variables Say the true model is the following: i i i i i x x x y εββββ++++=3322110.But for some reason we only collect or consider data on y, x 1 and x 2. Therefore, weomit x 3 in the regression. That is, we omit x 3 in our model. The short story is that we are likely to have a bias due to the omission of a relevant variable in the model. This is so even though our primary interest is to estimate the effect of x 1 or x 2 on y . Give you an example. For a group of Chinese youths between ages 20-30: y = earnings x 1 = educationx 2 = party member status x 3 = ageIf we ignore age, the effects of education and party member status are likely to be biased(1) because party members are likely to be older than non-party members and old people earn more than the young.(2) because older people are likely to have more education in this age interval, and older people on average earn more than young people.But why? We will have a formal presentation of this problem.III: Empirical Example of Incremental R-SquaresXie and Wu’s (2008, China Quarterly) study of earnings inequality in three Chinese cities: Shanghai, Wuhan, and Xi’an in 1999. See the following tables:Variables DF R2∆R2(1) ∆R2(2)City 2 17.47 *** 18.11 *** 19.12 *** Education Level 5 7.82 *** 5.49 *** 4.46 *** Experience+Experience2 2 0. 23 0.17 0.05 Gender 1 4.78 *** 4.84 *** 3.05 *** Cadre Status 1 3.08 *** 2.27 *** 0.63 *** Sector 3 3.54 *** 2.18 *** 1.80 *** Danwei Profitability(linear) 1 12.52 *** 9.30 *** Danwei Profitability(dummies) 4 12.89 ***N = 1771Note: DF refers to degrees of freedom.∆R2(1) refers to the incremental R2 after the inclusion of Danwei'sfinancial situation (linear).∆R2(2) refers to the incremental R2 after the inclusion of all the other variables.*** p< 0.001, ** p<0.01, * p<0.05, based on F-tests.Source: 1999 Three-City Survey.Table 2: Estimated Regression Coefficients on Logged EarningsVariables Observed Effects Adjusted Effects β SE(β) β SE(β)City (Shanghai=excluded)-0.465 *** 0.033 -0.539 *** 0.028 Xi’an-0.628 *** 0.034 -0.658 *** 0.028 Constant 9.402 *** 0.024Education Level (no schooling=excluded)Primary 0.536 * 0.216 0.414 * 0.170 Junior high 0.737 *** 0.202 0.447 ** 0.161 Senior high 0.770 *** 0.201 0.592 *** 0.161 Junior college 1.049 *** 0.203 0.778 *** 0.162 College 1.253 *** 0.207 0.923 *** 0.166 Constant 8.120 *** 0.210Experience+Experience2-11.235 6.029 2.421 4.775 Experience2 (x1000) 0.288 * 0.144 -0.017 0.114 Constant 9.113 *** 0.059Gender (male=excluded)Female -0.276 *** 0.029 -0.225 *** 0.023 Constant 9.144 *** 0.019Cadre Status (non-cadre=excluded)0.375 *** 0.050 0.185 *** 0.042 Constant 8.992 *** 0.015Sector (government+public=excluded)State owned -0.133 *** 0.037 -0.043 0.030 Collectively owned -0.397 *** 0.057 -0.224 *** 0.045 Privately owned 0.027 0.047 0.114 ** 0.037 Constant 9.129 *** 0.031Danwei Profitability (linear) 0.256 *** 0.016 0.227 *** 0.013 Constant 8.270 *** 0.050Danwei Profitability (dummies)(very poor=excluded)Relatively poor 0.100 0.062Average 0.405 *** 0.054Fairly good 0.702 *** 0.059Very good 0.918 *** 0.108Constant 8.624 *** 0.050Constant 8.237 *** 0.171 R2 (N = 1771)43.92%Note: Observed effects on logged earnings are derived from bivariate models. Adjusted effects are derived from a multivariate model including all variables.*** p< 0.001, ** p<0.01, * p<0.5, based on two-sided t-tests.IV. Auxiliary Regressions True regression:(1)i p i p p i p i i x x x y εββββ+++++=----)1()1()2()2(110Without loss of generality, say that 1-p x is omitted: (2)i p i p i i x x y δααα++++=--)2()2(110We can have an auxiliary regression to express the missing variable x p -1: (3)i p i p i p x x x μτττ++++=---)2()2(1101Substitute (3) into (1): (4) ii p p i p p p i p p ii p i p i p p i p i i x x x x x x y εμβτββτββτββεμτττββββ++++++++=+++++++++=------------)1()2()2()1()2(11)1(10)1(0)2()2(110)1()2()2(110)()()(Now you can see where the biases are: k p k k τββα1-+=,where k p τβ1- is called the bias. The bias is the product of the effect of the omitted variable on the dependent variable (1-p β) and the effect of the independent variable of interest on the omitted variable (k τ). Thus, there are two conditions for the existence of omitted variable bias: (1) Relevance Condition: the variable omitted is relevant, that is 0)1(≠-p β(2) Correlation Condition: the variable omitted is related to the variable of interest.[blackboard : path diagram]How does an experimental design help eliminate omitted bias? [through eliminating the correlation condition)Short of experiments, we need to include all relevant variables in order to avoid omitted variable bias.V. Criteria for Variable Inclusion and ExclusionA. Theoretical Reasoning for InclusionB. F-Tests to Exclude Irrelevant Variables: A Review Do not exclude variables on the basis of F-tests alone. Both theoretical reasoning andF-tests should be used.VI. Partial Regression Estimation True regression:i p i p i i x x y εβββ++++=--)1()1(110....In matrix:εβ+=X yThis model can always be written into: (1)1122y X X ββε=++where now []⎥⎦⎤⎢⎣⎡==2121,βββX X X21X and X are matrices of dimensions )(,2121p p p p n p n =+⨯⨯ and 21ββ+are parametervectors, of dimensions 21,p p .We first want to prove that regression equation (1) is equivalent to the following procedure: (1) Regress1on X y , obtain residuals *y ;(2) Regress 12on X X , obtain residuals *2X ;(3) Then regress *y on *2X , and obtain the correct least squares estimates of 2β(=2b ), sameas those from the one-step method.This is an alternative way to estimate 2β. Note that we do not obtain estimates for 1β in thisway. Proof:pre-multiply (1) by ')'(11111X X X X -:(1H hat matrix based on X 1; ).')'( call us let 111111X X X X H -=(2)εββ')'(')'(')'(')'(111112211111111111111111X X X X X X X X X X X X X X y X X X X ----++= (The last term is zero by assumption), thus (3)221111ββX H X y H +=Now (1)-(3):εβ+-+=-2211)(0)(X H I y H I(4)εβ+=22**X y .Therefore, the estimation of 2βfrom (4) should be identical to the estimation of 2βfrom (1). This is always true: three-step estimation of partial regression coefficients.Note the same residual term ε appears in (4) as in (1). Interpretation:(1) purge y of linear combinations of *1y ⇒X . (2) purge 2X of linear combinations of *21X ⇒X(3) regress y * on 2X *, noting that both y* and 2X * are purged of the confounding effects of X 1.Note: 2X in general can be a matrix, i.e., contains more than one independent variable. Regress one column in 2X at a time, until all columns are regressed on 1X .Note that the degrees of freedom for MSE from the last step is not correct. This is the only thing that needs to be adjusted for manually. DF for MSE = n-p (instead of n-p 2)This is an important result. It was on my preliminary exam. Unfortunately, it is often neglected (even in textbooks).VII. Partial Regression Plots (Added Variable Plots)We now focus on one (and only one) particular independent variable and wish to see its effect while controlling for other independent variables. A. Based on "Partial Regression Estimation" with 3 steps:We divide X into two parts: X -k + x k1. Regress y on X -k , obtain residuals called y*2. Regress x k on X -k , obtain residuals called x k *3. Regress y* on x k* gives the true partial regression coefficient of x k.Definition: A plot between y* and x k* is called a partial regression plot or an added-variable plot.Question: what is different between the three-stage partial regression method and a one-step multiple regression?Point estimate: the sameResiduals: the sameMSE: different because it is necessary to adjust the degrees of freedom: n-p instead of n-1.Therefore, you cannot use the computer output directly for any statistical inference if you do a 3-step partial regression estimation.B. Sum of Squares from the Partial Regression PlotsVIII. Illustration with an example:Model Description SSE DF SSE1 y on 1, x1, x2SSE1n-32 y on 1, x1⇒ y* SSE2n-23 x2 on 1, x1⇒ x2* SSE3n-24 y* on x2* SSE4n-3Note: SSE4 = SSE1; DE SSE4 = DE SSE1For example, if we only know Models 2 and 4, can we test the hypothesis that x2 has no effect after x1 is controlled?Yes. We can nest Models 2 and 4:F1, n-3 = (SSE2 - SSE4)/(SSE4/(n-3)).C. Examplesy = incomex1 = sexx2 = abilityn=100Model Description SSE DF SSE R2 SST1 y on 1, x1, x230 [97] .70 SST=1002 y on 1, x1⇒ y* 60 [98] [.40]3 x2 on 1, x1⇒ x2* 4000 [98] 0.60 SST=10000Reported4a y* on x2* 30 [99] [0.50] *ImportantMeaningful4b y* on x2* 30 [97] [0.70]Test the hypothesis that x2 has no partial effect after controlling for x1 even if we do not run Model 1:Nesting Models 2 and 4b:F = (60 - 30)/(30/97) = 30/0.31 = 96.77, significant.。
回归分析学习课件PPT课件
为了找到最优的参数组合,可以使用网格搜索方 法对参数空间进行穷举或随机搜索,通过比较不 同参数组合下的预测性能来选择最优的参数。
非线性回归模型的假设检验与评估
假设检验
与线性回归模型类似,非线性回归模型也需要进行假设检验,以检验模型是否满足某些统计假 设,如误差项的独立性、同方差性等。
整估计。
最大似然法
03
基于似然函数的最大值来估计参数,能够同时估计参数和模型
选择。
多元回归模型的假设检验与评估
线性假设检验
检验回归模型的线性关系 是否成立,通常使用F检 验或t检验。
异方差性检验
检验回归模型残差的异方 差性,常用的方法有图检 验、White检验和 Goldfeld-Quandt检验。
多重共线性检验
检验回归模型中自变量之 间的多重共线性问题,常 用的方法有VIF、条件指数 等。
模型评估指标
包括R方、调整R方、AIC、 BIC等指标,用于评估模 型的拟合优度和预测能力。
05
回归分析的实践应用
案例一:股票价格预测
总结词
通过历史数据建立回归模型,预测未来股票 价格走势。
详细描述
利用股票市场的历史数据,如开盘价、收盘价、成 交量等,通过回归分析方法建立模型,预测未来股 票价格的走势。
描述因变量与自变量之间的非线性关系,通过变 换或使用其他方法来适应非线性关系。
03 混合效应回归模型
同时考虑固定效应和随机效应,适用于面板数据 或重复测量数据。
多元回归模型的参数估计
最小二乘法
01
通过最小化残差平方和来估计参数,是最常用的参数估计方法。
加权最小二乘法
02
适用于异方差性数据,通过给不同观测值赋予不同的权重来调
第6章回归分析
83 75 8 女 81 3 16 男 81 0 12 女 81 13 12 女 79 94 12 男 74 45 16 男 74 2 12 女 74 272 12 男 72 184 8 女 71 12 16 女 69 12 12 女 68 344 8 女 68 155 8 男 67 6 15 男 67 181 12 女 66 50 18 男 65 19 16 男 64 69 12 男
统计学
费宇,石磊 主编 高等教育出版社
2020/7/27
《统计学》第3章参数估计
6-1
第6章 回归分析
6.1 相关分析 6.2 一元线性回归 6.3 多元线性回归 6.4 虚拟变量回归 6.5 Logistic回归 6.6 回归分析的扩展 6.7 可化为线性情形的非线性回归
2020/7/27
《统计学》第3章参数估计
2020/7/27
《统计学》第3章参数估计
6-3
表 6.0 抽样调查得到的 36 个人的数据资料
y
x1
x2 x3 x4 性别
y
x1
x2 x3 x4 性别
29220 14010 29670 13260 136320 81240 111945 46260 24570 15510 36120 15810 41520 20760 32820 20010 25620 16260 32220 16260 28020 14760 26370 14010 28020 14760 70570 43740 33270 16260 27570 16860 18420 11460 25320 14010
6-2
【引例6.0】
(数据文件为example 6.0)某公司经理想 研究公司员工的年薪问题,根据初步分析, 他认为员工的当前年薪y(元)与员工的开始 年薪x1(元)、在公司的工作时间x2(月)、先 前的工作经验x3(月)和受教育年限x4(年)有 关系,他随机抽样调查了36个员工,收集 到以下数据:
Logistic_回归分析作业答案[3页]
第六章 Logistic回归练习题 (操作部分:部分参考答案)1. 下面问题的数据来自“ch6-logistic_exercise”,数据包含受访者的人口学特征、劳动经济特征、流动身份。
数据的变量及其定义如下:变量名变量的定义age 年龄,连续测量degree 受教育程度:1=未上过学;2=小学;3=初中;4=高中;5=大专;6=大学;7=研究生girl 性别:1=女性;0=男性hanzu 民族:1=汉族;0=少数民族hetong 劳动合同:1=固定合同;2=非固定合同;3=无合同income 月收入ldhour 每周劳动时间married 婚姻状态:1=在婚;0=其他(未婚、离异、再婚、丧偶,等)migtype4 流动身份:1=本地市民;2=城-城流动人口;3=乡-城流动人口pid IDss_jobloss 失业保险:1=有;0=无ss_yanglao 养老保险:1=有;1=无这里的研究问题是,流动人口与流入地居民在社会保障、劳动保护和居住环境等方面是否存在显著差别。
流动人口被区分为城-城流动人口(即具有城镇户籍、但离开户籍地半年以上之人)和乡-城流动人口(即具有农村户籍、且离开户籍地半年以上之人)。
因此,样本包含三类人群:本地市民、城-城流动人口、乡-城流动人口及相应特征。
说明:(1)你需要对数据进行一些必要的处理,才能正确回答研究问题;(2)将变量hetong的缺失数据作为一个类别;(3)将degree合并为四类:<=小学,初中、高中、>高中. use "D:\course\integration of theory andmethod\8_ordered\chapter8-logistic_exercise.dta", clear*重新三个社会保障变量. gen ss_jobl=ss_jobloss==1. gen ss_ylao=ss_yanglao==1. gen ss_yili=ss_yiliao ==1*重新code受教育程度. recode degree (1/2=1) (3=2) (4=3)(5/7=4)*将劳动合同的缺失作为一个分类. recode hetong (.=4)请基于该数据,完成以下练习,输出odds ratio的分析结果:其一,运用二分类Logistic模型,探讨流动人口的社会保障机会。
实用回归分析第二版第六章习题答案
实用回归分析第二版第六章习题答案
6.1试举一个产生多重共线性的经济实例。
答:例如有人建立某地区粮食产量回归模型,以粮食产量为因变量¥,化肥用量为X1,水浇地面积为X2,农业投入资金为X3。
由于农业投入资金X3与化肥用量X1,水浇地面积X2有很强的相关性,所以回归方程效果会很差。
再例如根据某行业企业数据资料拟合此行业的生产函数时,资本投入、劳动力投入、资金投入与能源供应都与企业的生产规模有关,往往出现高度相关情况,大企业二者都大,小企业都小。
6.2多重共线性对回归参数的估计有何影响?答:
1、完全共线性下参数估计量不存在;
2、近似共线性下OLS估计量非有效;
3、参数估计量经济含义不合理;
4、变量的显著性检验失去意义;
5、模型的预测功能失效。
6.3具有严重多重共线性的回归方程能不能用来做经济预测?
答:虽然参数估计值方差的变大容易使区间预测的“区间”变大,使预测失去意义。
但如果利用模型去做经济预测,只要保证自变量的相关类型在未来期中一直保持不变,即使回归模型中包含严重多重共线性的变量,
也可以得到较好预测结果;否则会对经济预测产生严重的影响。
6.4多重共线性的产生于样本容量的个数n、自变量的个数p有无关系?
答:有关系,增加样本容量不能消除模型中的多重共线性,但能适当消除多重共线性造成的后果。
当自变量的个数p较大时,一般多重共线性容易发生,所以自变量应选择少而精。
B6应用或创建回归模型总结
B6应用或创建回归模型总结回归模型是一种统计分析方法,用于建立变量之间的关系。
在本文档中,我们总结了B6回归模型的应用和创建过程。
应用B6回归模型常用于分析两个以上连续型变量之间的关系。
它可以用于预测和解释变量之间的数值关系,以及对未来的观测进行预测。
在实际应用中,B6回归模型可以用于各种不同领域的问题。
例如,在金融领域,我们可以使用B6回归模型来预测股票价格的变化,根据一些基本的金融指标。
在医学领域,我们可以利用B6回归模型来分析药物剂量与治疗效果之间的关系。
创建回归模型的步骤创建B6回归模型的过程包括以下步骤:1. 数据收集:首先,我们需要收集所需的数据。
这可以通过实地调查、实验室实验或从已有的数据集中获取。
2. 数据准备:在创建回归模型之前,我们需要对数据进行清洗和准备。
这包括处理缺失值、异常值和数据转换等。
3. 变量选择:接下来,我们需要选择自变量和因变量。
自变量是我们用来预测因变量的变量,而因变量是我们要预测或解释的变量。
4. 模型拟合:拟合回归模型时,我们需要选择适当的回归方法和模型类型。
常见的回归方法包括线性回归、多项式回归和逻辑回归等。
5. 模型评估:一旦模型被拟合,我们需要评估其性能。
这可以通过使用不同的评估指标(如R平方、均方根误差等)来完成。
6. 预测与解释:最后,我们可以使用拟合好的回归模型来进行预测和解释。
通过输入自变量的值,我们可以预测因变量的值,并对其进行解释。
注意事项在实际应用中,创建和应用回归模型时需要注意以下事项:- 数据质量:确保数据的准确性和完整性,以避免对回归结果产生不良影响。
- 模型假设:回归模型有一些重要的假设,如线性关系、误差独立性和误差服从正态分布等。
在创建和解释模型时,需要确保这些假设得到满足。
- 模型选择:选择合适的模型类型和方法,以适应数据的性质和问题的要求。
不同类型的回归模型适用于不同类型的变量和问题。
以上是关于B6回归模型应用和创建过程的总结。
北大暑期课程《回归分析》(Linear-Regression-Analysis)讲义PKU6
Class 6: Auxiliary regression and partial regression plots.More independent variables?I. Consequences of Including Irrelevant Independent VariablesWhat are the consequences of including irrelevant independent variables? In otherwords, should we always include as many independent variables as possible?The answer is no. You should always have good reasons for including your independentvariables. Do not include irrelevant independent variables. There are four reasons:A. Missing Theoretically Interesting FindingsB. Violating the Parsimony Rule (Occom's Razor)C. Wasting Degrees of FreedomD. Making Estimates Imprecise. (e.g., through collinearity).Conclude: Inclusion of irrelevant variables reduces the precision of estimation. II. Consequences of Omitting Relevant Independent Variables Say the true model is the following: i i i i i x x x y εββββ++++=3322110.But for some reason we only collect or consider data on y, x 1 and x 2. Therefore, weomit x 3 in the regression. That is, we omit x 3 in our model. The short story is that we are likely to have a bias due to the omission of a relevant variable in the model. This is so even though our primary interest is to estimate the effect of x 1 or x 2 on y . Give you an example. For a group of Chinese youths between ages 20-30: y = earnings x 1 = educationx 2 = party member status x 3 = ageIf we ignore age, the effects of education and party member status are likely to be biased(1) because party members are likely to be older than non-party members and old people earn more than the young.(2) because older people are likely to have more education in this age interval, and older people on average earn more than young people.But why? We will have a formal presentation of this problem.III: Empirical Example of Incremental R-SquaresXie and Wu’s (2008, China Quarterly) study of earnings inequality in three Chinese cities: Shanghai, Wuhan, and Xi’an in 1999. See the following tables:Table 1: Percent Variance Explained in Logged EarningsVariables DF R2∆R2(1) ∆R2(2)City 2 17.47 *** 18.11 *** 19.12 *** Education Level 5 7.82 *** 5.49 *** 4.46 *** Experience+Experience2 2 0. 23 0.17 0.05 Gender 1 4.78 *** 4.84 *** 3.05 *** Cadre Status 1 3.08 *** 2.27 *** 0.63 *** Sector 3 3.54 *** 2.18 *** 1.80 *** Danwei Profitability(linear) 1 12.52 *** 9.30 *** Danwei Profitability(dummies) 4 12.89 ***N = 1771Note: DF refers to degrees of freedom.∆R2(1) refers to the incremental R2 after the inclusion of Danwei'sfinancial situation (linear).∆R2(2) refers to the incremental R2 after the inclusion of all the other variables.*** p< 0.001, ** p<0.01, * p<0.05, based on F-tests.Source: 1999 Three-City Survey.Table 2: Estimated Regression Coefficients on Logged EarningsVariables Observed Effects Adjusted Effects β SE(β) β SE(β)City (Shanghai=excluded)Wuhan -0.465 *** 0.033 -0.539 *** 0.028 Xi’an-0.628 *** 0.034 -0.658 *** 0.028 Constant 9.402 *** 0.024Education Level (no schooling=excluded)Primary 0.536 * 0.216 0.414 * 0.170 Junior high 0.737 *** 0.202 0.447 ** 0.161 Senior high 0.770 *** 0.201 0.592 *** 0.161 Junior college 1.049 *** 0.203 0.778 *** 0.162 College 1.253 *** 0.207 0.923 *** 0.166 Constant 8.120 *** 0.210Experience+Experience2Experience (x1000) -11.235 6.029 2.421 4.775 Experience2 (x1000) 0.288 * 0.144 -0.017 0.114 Constant 9.113 *** 0.059Gender (male=excluded)Female -0.276 *** 0.029 -0.225 *** 0.023 Constant 9.144 *** 0.019Cadre Status (non-cadre=excluded)Cadre 0.375 *** 0.050 0.185 *** 0.042 Constant 8.992 *** 0.015Sector (government+public=excluded)State owned -0.133 *** 0.037 -0.043 0.030 Collectively owned -0.397 *** 0.057 -0.224 *** 0.045 Privately owned 0.027 0.047 0.114 ** 0.037 Constant 9.129 *** 0.031Danwei Profitability (linear) 0.256 *** 0.016 0.227 *** 0.013 Constant 8.270 *** 0.050Danwei Profitability (dummies)(very poor=excluded)Relatively poor 0.100 0.062Average 0.405 *** 0.054Fairly good 0.702 *** 0.059Very good 0.918 *** 0.108Constant 8.624 *** 0.050Constant 8.237 *** 0.171 R2 (N = 1771)43.92%Note: Observed effects on logged earnings are derived from bivariate models. Adjusted effects are derived from a multivariate model including all variables.*** p< 0.001, ** p<0.01, * p<0.5, based on two-sided t-tests.IV. Auxiliary Regressions True regression:(1)i p i p p i p i i x x x y εββββ+++++=----)1()1()2()2(110Without loss of generality, say that 1-p x is omitted: (2)i p i p i i x x y δααα++++=--)2()2(110We can have an auxiliary regression to express the missing variable x p -1: (3)i p i p i p x x x μτττ++++=---)2()2(1101Substitute (3) into (1): (4) ii p p i p p p i p p ii p i p i p p i p i i x x x x x x y εμβτββτββτββεμτττββββ++++++++=+++++++++=------------)1()2()2()1()2(11)1(10)1(0)2()2(110)1()2()2(110)()()(Now you can see where the biases are: k p k k τββα1-+=,where k p τβ1- is called the bias. The bias is the product of the effect of the omitted variable on the dependent variable (1-p β) and the effect of the independent variable of interest on the omitted variable (k τ). Thus, there are two conditions for the existence of omitted variable bias: (1) Relevance Condition: the variable omitted is relevant, that is 0)1(≠-p β(2) Correlation Condition: the variable omitted is related to the variable of interest.[blackboard : path diagram]How does an experimental design help eliminate omitted bias? [through eliminating the correlation condition)Short of experiments, we need to include all relevant variables in order to avoid omitted variable bias.V. Criteria for Variable Inclusion and ExclusionA. Theoretical Reasoning for InclusionB. F-Tests to Exclude Irrelevant Variables: A Review Do not exclude variables on the basis of F-tests alone. Both theoretical reasoning andF-tests should be used.VI. Partial Regression Estimation True regression:i p i p i i x x y εβββ++++=--)1()1(110.... In matrix:εβ+=X yThis model can always be written into: (1)1122y X X ββε=++where now []⎥⎦⎤⎢⎣⎡==2121,βββX X X21X and X are matrices of dimensions )(,2121p p p p n p n =+⨯⨯ and 21ββ+are parametervectors, of dimensions 21,p p .We first want to prove that regression equation (1) is equivalent to the following procedure: (1) Regress1on X y , obtain residuals *y ;(2) Regress 12on X X , obtain residuals *2X ;(3) Then regress*y on *2X , and obtain the correct least squares estimates of 2β(=2b ), same as those from the one-step method.This is an alternative way to estimate 2β. Note that we do not obtain estimates for 1β in thisway. Proof:pre-multiply (1) by ')'(11111X X X X -:(1H hat matrix based on X 1; ).')'( call us let 111111X X X X H -=(2)εββ')'(')'(')'(')'(111112211111111111111111X X X X X X X X X X X X X X y X X X X ----++= (The last term is zero by assumption), thus (3)221111ββX H X y H +=Now (1)-(3):εβ+-+=-2211)(0)(X H I y H I(4)εβ+=22**X y .Therefore, the estimation of 2βfrom (4) should be identical to the estimation of 2βfrom (1). This is always true: three-step estimation of partial regression coefficients.Note the same residual term ε appears in (4) as in (1). Interpretation:(1) purge y of linear combinations of *1y ⇒X . (2) purge 2X of linear combinations of *21X ⇒X(3) regress y * on 2X *, noting that both y* and 2X * are purged of the confounding effects of X 1.Note: 2X in general can be a matrix, i.e., contains more than one independent variable. Regress one column in 2X at a time, until all columns are regressed on 1X .Note that the degrees of freedom for MSE from the last step is not correct. This is the only thing that needs to be adjusted for manually. DF for MSE = n-p (instead of n-p 2)This is an important result. It was on my preliminary exam. Unfortunately, it is often neglected (even in textbooks).VII. Partial Regression Plots (Added Variable Plots)We now focus on one (and only one) particular independent variable and wish to see its effect while controlling for other independent variables. A. Based on "Partial Regression Estimation" with 3 steps:We divide X into two parts: X -k + x k1. Regress y on X -k , obtain residuals called y*2. Regress x k on X -k , obtain residuals called x k *3. Regress y* on x k* gives the true partial regression coefficient of x k.Definition: A plot between y* and x k* is called a partial regression plot or an added-variable plot.Question: what is different between the three-stage partial regression method and a one-step multiple regression?Point estimate: the sameResiduals: the sameMSE: different because it is necessary to adjust the degrees of freedom: n-p instead of n-1.Therefore, you cannot use the computer output directly for any statistical inference if you do a 3-step partial regression estimation.B. Sum of Squares from the Partial Regression PlotsVIII. Illustration with an example:Model Description SSE DF SSE1 y on 1, x1, x2SSE1n-32 y on 1, x1⇒ y* SSE2n-23 x2 on 1, x1⇒ x2* SSE3n-24 y* on x2* SSE4n-3Note: SSE4 = SSE1; DE SSE4 = DE SSE1For example, if we only know Models 2 and 4, can we test the hypothesis that x2 has no effect after x1 is controlled?Yes. We can nest Models 2 and 4:F1, n-3 = (SSE2 - SSE4)/(SSE4/(n-3)).C. Examplesy = incomex1 = sexx2 = abilityn=100Model Description SSE DF SSE R2 SST1 y on 1, x1, x230 [97] .70 SST=1002 y on 1, x1⇒ y* 60 [98] [.40]3 x2 on 1, x1⇒ x2* 4000 [98] 0.60 SST=10000Reported4a y* on x2* 30 [99] [0.50] *ImportantMeaningful4b y* on x2* 30 [97] [0.70]Test the hypothesis that x2 has no partial effect after controlling for x1 even if we do not run Model 1:Nesting Models 2 and 4b:F = (60 - 30)/(30/97) = 30/0.31 = 96.77, significant.。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Methadone
Plot versus age
Studentised residuals
1 -2
20
-1
0
2
25
30
35
40
45
50
55
age
• Comment: There is no obvious pattern and so a linear model in age seems fine.
6.4. Leverage plots
• It can be shown that • ������ ������=1 ℎ������������ = ������ + 1 . • Thus ℎ������������ =
������+1 . ������
• Hence we use a cut-off of leverage points. • For the methadone example:
Theoretical Quantiles
6.6 Influential points
• For the multivariate case, Cook’s distance is defined in an analogous way, i.e. • ������������ =
1 (������+1) ∗ 2 ℎ������������ ������������ . 1−ℎ������������
• ������������ =
∗
������������ ������2 (1−ℎ������������ )
(6.2)
6.3 Residual plots
• We use residual plots in exactly the same way as before, except that we plot the studentized residuals against the fitted values and each explanatory variable. • We can pick up non-constant variance from the plot against the fitted values and also any possible outliers, remembering that 95% of observations should be in the range [-2, 2]. • Any curvature when plotted against an explanatory variable would suggest including a quadratic term in that variable. • For the bivariate Methadone example, we use: • >plot(fitted.values(fitbi),rstandard(fitbi),xlab="Fitted values",ylab="Studentised residuals",b=1.4)
Plot versus fitted values
Studentised residuals
1 -2 -1 0 2
5.5
6.0
6.5
7.0
• Comment: There is no obvious pattern in the plot, suggesting the constant variance assumption is OK. All the studentized residuals are in the range [-2, 2] and thus there are no obvious outliers.
• Both models should be reported in any report.
6.7 Further example
• Let us consider the ash data in females again and add in weight at 12 as a further explanatory variable. • > fitashbi=lm(ht32~ht12+wt12, data=ash30f) • > summary(fitashbi) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.7014039 0.1165922 6.016 2.49e-08 *** ht12 0.6994202 0.0927759 7.539 1.56e-11 *** wt12 -0.0026917 0.0008347 -3.225 0.00167 ** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.05123 on 108 degrees of freedom Multiple R-squared: 0.3808, Adjusted R-squared: 0.3693 F-statistic: 33.21 on 2 and 108 DF, p-value: 5.75e-12 • Comment: We see that weight at 12 has a negative coefficient implying that a high weight at 12 reduces height at 32.
6.2 Studentized residuals
• As with simple linear regression, the residuals have different variances dependent on the values of the explanatory variables, i.e. • ������������������ ������������ = 1 − ℎ������������ ������ 2 . • As mentioned in chapter 5, we can estimate ������ 2 by:
Chapter 6
Multivariate model validation
• We validate the multivariate model using essentially the same tools and techniques that we used for the simple linear regression model. • 6.1 Residuals and Leverage • The residual vector is: • ������ = ������ − ������. • This has variance given by: • ������������������ ������ = I − H ������ 2 . • This time we look for large ℎ������������ values(the leverages). • Usually values of ℎ������������ > 2(������ + 1)/������ are regarded as indicating points of high leverage.
Index
6.5 Normal plots
• We can check the Normality of the studentised residuals by using the same commands in R as before.
• >qqnorm(rstandard(fitbi), ylab="studentized residuals",b=1.4)
Fitted values
Plot versus Methadone
Studentised residuals
1 -2 -1 0 2
2
4
6
8
10
• Comment: There is no obvious curvature in the plot, suggesting a linear model with Methadone, but there is a suggestion that the observations at higher Methadone values are rather more variable.
• ������ 2 =
������ 2
=
������������������ (������−������−1)
=
������ ������=1
������������ −������������ 2
(������−������−1)
(6.1)
• Thus as before we standardise and use the studentized residuals defined by:
2(������+1) ������
to pick out high
• >plot(hatvalues(fitbi), ylab="leverages“,b=1.4) • >abline(h=6/17)
leverages
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
LeveragesLet us see the effect of removing point 2. >QTC1=QTC[-2] >Methadone1=Methadone[-2] >age1=age[-2] >fitbi1=lm(QTC1~Methadone1+age1) >summary(fitbi1)