经典真分数测量理论

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

经典真分数测量理论
Classical True Score Measurement Theory
(CTS)
人们将以真分数理论为核心理论假设的测量理论及其方法体系统称为经典测验理论(CTT)，•也称真分数理论(CTS)。

真分数是指被测者在所测特质(如能力、知识、个性等)上的真实值，即真分数(True Score) 。

而通过一定测量工具(如测验量表和测量仪器)进行测量，在测量工具上直接获得的值(读数) ，叫观测值或观察分数(Observed Score)。

由于有测量误差存在，所以，观察值并不等于所测特质的真实值，即观察分数中含有真分数和误差分数(Error Score)。

而要获得对真实分数的值，就必须将测量的误差从观察分数中分离出来。

真分数理论
三个假设及两个推论
真分数理论假设(1)：
真分数具有不变性
这一假设的实质是指真分数所指代的被测者的某种特质必须具有某种程度的稳定性，至少在所讨论的问题范围内，或者在一个特定的时间内，个体具有的特质为一个常数，保持恒定。

真分数理论假设(2)：
真误差是完全随机的
【假设公理一】：测量误差是一个平均数为零的正态随机变量。

在多次测量中，误差有正有负。

如果测量误差为正值，观测分数就会高于其实际的分数（真分数）；如果测量误差为负值，则观测分数就会低于其实际的分数，即观察分数会出现上下波动的现象。

但是，只要重复测量次数足够多，这种正负偏差就会两相抵消，测量误差的平均数恰好为零。

用数学式表达为：E(E)=0。

【假设公理二】：测量误差分数与所测的特质或者说真分数之间相互独立。

不仅如此，测量误差之间、测量误差与所测特质外其它变量间，也相互独立。

或者说，他们之间的相关为零【注释：如果承认这种交互作用，则只能用GT来解释和计算】。

真分数理论假设(3)：
观测分数是真分数与误差分数的和
S＝Ｔ＋Ｅ
【含义】：观察分数与真实分数之间是线性关系，而不是其它关系。

相差的就是误差分数。

真分数理论推论(1)
真分数等于观察分数的平均数（T=E(X)）
（Gulliksen，1950）
【含义】：若一个人的某种心里特质可以用平行的测验反复测量足够多次，则其观察分数的平均值会接近于真分数。

真分数理论推论(2)
在一组测量分数中，观察分数的变异数（方差）等于真分数的变异数（方差）与误差分数的变异数（方差）之和。

S2X= S2T + S2E
【注释】：这里的误差分数方差是随机误差的方差，系统误差的方差包含在真分数方差中，可以理解为：
真分数方差=与测量目的相关方差*与测量目的无关的系统性方差
经典测量理论在真分数理论假设的基石上构建起了它的理论大厦，主要包括信度、效度、项目分析、常模、标准化等基本概念。

Measurement Error
Measurement error (or error variance) is a term that describe the VARIANCE in scores on a test that is not directly related to the purpose of the test.
The performances of students on any test will tend to vary from each other, but their performances can vary for a variety of reasons.
•
These variables fall into two general sources of variance:
(a) those creating variance related to the purpose of the test (called meaningful
variance), and
(b) those generating variance due to other extraneous sources (called measurement
error, or error variance).
•
In order to minimize all those undesirable test-purpose-unrelated variance in students’ scores, test developers must use the following tables as carefully as possible.
为保证有效性抽样，一般得先从目标能力A中选出一个有效的能力抽样a ，然后找出能表征这个能力抽样a的行为b，那么这些行为就应该是全部目标行为的有效抽样了。

假设命题（1）B ---》A
⏹假设命题（2）a ---》A
⏹假设命题（3）b ---》a
⏹推导命题（4）b ---》B
•上述（1），（2），（3）假设关系确定后，我们推出b-B 之间的命题关系。

⏹推导命题（5）b ---》A
•根据所测试的行为抽样推论出目标能力。

⏹考试就此结束了吗？
⏹语言测量是对语言行为的属性进行量化；
⏹所以语言行为抽样b 的测量最终要体现在分数或等级上；即测量结果反馈F。

⏹假设命题（6）：F 是b 的正确标示，即
F ---》b
⏹假设命题（1）B ---》A
⏹假设命题（2）a ---》A
⏹假设命题（3）b ---》a
⏹推导命题（4）b ---》B
•上述（1），（2），（3）假设关系确定后，我们推出b-B 之间的命题关系。

⏹推导命题（5）b ---》A
•根据所测试的行为抽样推论出目标能力。

⏹假设命题（6）F ---》b
•语言行为抽样 b 的测量最终要体现在分数或等级上
⏹推导命题（7）F ---》A
In general
test reliability
is defined
as
the extent to which the results
can be considered
consistent or stable
Personal attributes that are not related to language ability include:
•individual characteristics such as
- cognitive style and
- knowledge of particular content areas
•group characteristics such as
- sex
- race
- ethnic background
Random factors are largely unpredictable and temporary such as
1) Mental alertness or emotional state, and
2) Uncontrolled differences in test method facets e.g., changes of test environment from one day to the next
The degree to which a test is consistent, or stable, can be estimated by calculating a reliability coefficient.
两个原则性问题：
针对信度，回答问题：
How much variance in test scores is due to measurement error?
针对效度，回答问题：
What specific abilities account for the reliable variance in test scores?
The point is that, a test can be reliable without being valid. In other words, a test can consistently measure something other than that for which it was designed (这是因为信度是考试分数本身的属性，而效度是对考试分数解释和使用的准确性，所以两者虽密切联系，却性质不同).
Hence test reliability and validity, though related, are different test characteristics.
In fact, reliability can be viewed as a precondition for validity, that is, a test cannot be valid unless it is first reliable.
Validity is especially important when it is involved in the decisions that teachers regularly make about their students.
Teachers certainly want to base their admissions, placement, achievement, and diagnostic decisions on tests that are actually testing what they claim to measure.
Adopting, developing, and adapting tests for such decisions is difficult enough without having to also worry about whether the tests are measuring the wrong student characteristics, abilities, proficiencies, etc.
【基本问题】
1)测量什么属性；
2)对所欲测量的属性所测到的程度。

1）效度是针对测验结果而言的。

即测验效度是测验结果的有效性程度。

不是测验本身。

（2）效度是针对测验特定目的而言的。

它不具备普遍性。

所以在评价一个测验的效度时，必须考虑到其特殊用途，指明其对测量什么有效。

（3）效度只有程度上的差异。

它不是“有”和“无”的差别。

使用“高”、“中”、“低”来描述。

考试效度研究并不是检验考试内容本身，也不是检验考试分数本身的“效度”(考试分数本身不存在效度，仅仅存在信度问题--LP)，而是检验解释和使用考试分数的方式的效度。

Content Relevance
involves the specification of ability domain (Bachman, 1990:42-4, about operationally defining constructs);
requires the specification of the test method facets (ibid:119)(e.g., what it is that the test measures, the attributes of the stimuli that will be presented to the test-takers, the nature of the responses that the test taker is expected to make…);
Content Coverage
wish to have a well-defined domain that specified the entire set, or population, of
possible test tasks;
then, we could follow a standard procedure for random sampling (or stratified random sampling, in the case of heterogeneous domains) to insure that the tasks required by the test is representative of that domain.
Authenticity
“…define authenticity as the degree of correspondence of the characteristics of a given language test task to the features of a TLU task…”
(Bachman & Palmer, 1996:23)
特定考试任务特征与TLU任务特征之间的符合程度
例如，在研发阅读考试时，我们应该选择那些特征（内容、语篇结构、题材、题材等）与实际阅读环境中必读材料特征相符合的篇章作为考试用篇章。

We define interactiveness as the extent and type of involvement of the test taker’s individual characteristics in accomplishing a test task.”
The individual characteristics:
language ability (language knowledge and strategic competence, or metacognitive strategies),
2)topical knowledge, and
3)affective schemata.
例如，一个考试任务要求考生在理解所馈入部分的主题内容时要与其个人所掌握的相关主题知识关联起来，那么这个考试任务就更具有交互性。

Impact can be defined broadly in terms of the various ways in which test use affects the society, an education system, and the individuals ...”(Bachman & Palmer, 1996: 39)
Impact operates at two levels:
a) a micro level – in terms of the individuals who are affected by the particular test use;
b) a macro level – in terms of the educational system or society.
Practicality is included as a major concern in test design because that valid and reliable a test may be, if it is not practical to administer it in a specific context then it will not be taken up in that context.
Practicality covers a range of issues, such as
•The cost of development and maintenance
•Test length
•Ease of marking
•Time required to administer
•Ease of administration
•Equipment availability, etc.
Practicality can be defined “as the relationship between the resources that will be required in the design, development, and use of the test and the resources that will be available for these activities.”
available resources
Practicality = --------------------------
required resources
If >= 1, the test development and use is practical
If < 1, the test development and use is not practical
Resources:
-Human resources
Test writers, scorers, administrators, and clerical support -Material resources
-Space; Equipment; Materials
-Time。