北大暑期课程回归分析(I)
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Class 4: Inference in multiple regression. I. The Logic of Statistical Inference
The logic of statistical inference is simple: we would like to make inferences about a population from what we observe from the sample that has been drawn randomly from the population.
The samples' characteristics are called "point estimates."
It is almost certain that the sample's characteristics are somewhat different from the population's characteristics. But because the sample was drawn randomly from the population, the sample's characteristics cannot be "very different" from the
population's characteristics. What do I mean by "very different"? To answer this question, we need a distance measure (or dispersion measure), called the standard deviation of the statistic.
To summarize, statistical inferences consist of two steps:
(1) Point estimates (sample statistics) (2) Standard deviations of the point estimates (dispersion of sample statistics
in a hypothetical sampling distribution).
II. Review: For a sample of fixed size y n i ,,1 =is the dependent variable;
11,,1-=p x x X contains independent variables.
We can write the model in the following way:
Under certain assumptions, the LS estimator
As certain desirable properties:
A1 => unbiasedness and consistency A1 + A2 => BLUE, with 12()()V b X X σ-'
= A1 + A2 + A3 =>BUE 12()()V b X X σ-'
= (even for small samples) III. The Central Limit Theorem Statement: The mean of iid random variables (with mean of μ, and variance of 2σ) approaches a normal distribution as the number of random variables increases. The property belongs to the statistic -- sample mean in the sampling distribution of all sample means, even though the random variables themselves are not normally distributed. You can never check this normality because you can only have one sample statistic.
In regression analysis, we do not assume that ε is normally distributed if we have a large sample, because all estimated parameters approach to normal distributions.
Why: all LS estimates are linear function of (proved last time). Recall a theorem: a linear transformation of a variable distributed as normal is also distributed as normal. IV. Inferences about Regression Coefficients
A. Presentation of Regression Results
Common practice: give a star beside the parameter estimate for significance level of 0.05, two stars for 0.01, and three stars for 0.001. For example: Dependent Variable: Earnings
Independent Variable:
Father's education 0.900*
Mother's education 0.501***
Shoe size -2.16
What is the problem with this practice?
First, we want to have a quantitative evaluation of the significance level. We should not blindly rely on statistical tests. For example,
Father's education 0.900* (0.450)
Mother's education 0.501*** (0.001)
Shoe size -2.16 (1.10)
In this case, is father's education much more significant than shoe size? Not really. They are very similar. By contrast, mother's is far more significant than the other two.
A second practice is to report the t or z values:
Coeffi. t.
Father's education 0.900 2.0
Mother's education 0.501 500.
Shoe size -2.16 -1.96
This solution is much better. However, very often, our hypothesis is not about deviation from zero, but from other hypothetical values. For example, we are interested in the hypothesis whether a one-year increase in father's education will increase son's education by one year. The hypothesis here is 1 instead of 0.
The preferred way of presentation is:
Coeff. (S.E.)
Father's education 0.900 (0.450)
Mother's education 0.501 (0.001)
Shoe size -2.16 (1.10)
B. Difference between Statistical Significance and the Size of an Effect
Statistical significance always refers to a stated hypothesis. You will see a lot of misuses in the literature, sometimes by well-known sociologists. They would say
that this variable is highly significant. That one is not significant. This is not correct. I am not responsible for their mistakes, but I want to warn you not to commit the same mistakes again. In our example, you could say Mother's education is highly significant from zero. But it is not significant from 0.5. Had your hypothesis been that the parameter for Mother's education is 0.5, the result would be consistent with the hypothesis. That is, statistical significance should always be made with reference to a hypothesis. Follow
Another common mistake is to equate statistical significance with the size of an effect. A variable can be statistically significant from zero. But the estimated coefficient is small. The contribution of father's education to the dependent variable is larger than that of mother's education even though mother's education is more statistically significant from zero than father's education.
Important: you should look at both coefficients and their standard errors.
C. Confidence Intervals for Single Parameters
D. Hypothesis Testing for Single Parameters
Compare )(/)(0
j j j b SE b z β-=, if z is outside the range of -1.96 and 1.96, the hypothesis is rejected. Otherwise, we fail to reject the hypothesis.。