语言测试学资料3

合集下载

应用语言学语言测试理论知识点整理

应用语言学语言测试理论知识点整理

应用语言学语言测试理论知识点整理在应用语言学领域,语言测试理论是一个重要的分支,它对于评估语言学习者的语言能力、指导教学实践以及推动语言教育的发展都具有关键意义。

以下将对应用语言学语言测试理论的一些重要知识点进行整理。

一、语言测试的定义与目的语言测试是对语言学习者的语言能力进行测量和评估的一种手段。

其主要目的包括:1、为教育决策提供依据,例如确定学生的升级、留级或毕业。

2、评估教学效果,帮助教师了解教学方法的有效性和学生的学习进展。

3、为学生提供反馈,让他们了解自己的语言水平和不足之处,以便进一步改进学习策略。

二、语言测试的类型1、水平测试(Proficiency Test)旨在测量考生对某种语言的整体掌握程度,不考虑考生之前的学习经历或特定的课程内容。

常见的水平测试如雅思(IELTS)、托福(TOEFL)等。

2、成绩测试(Achievement Test)侧重于检测考生在特定课程或学习阶段所掌握的语言知识和技能,与教学内容紧密相关。

比如学校的期末考试、单元测验等。

3、诊断测试(Diagnostic Test)主要用于发现考生在语言学习中存在的具体问题和薄弱环节,以便为后续的教学和学习提供针对性的指导。

4、潜能测试(Aptitude Test)预测考生学习语言的潜力和能力,而非对现有语言水平的评估。

三、语言测试的质量评估标准1、效度(Validity)指测试能够准确测量出其所要测量的语言能力或语言知识的程度。

效度分为内容效度、结构效度、预测效度等。

内容效度:测试内容是否涵盖了所要考查的语言技能和知识点。

结构效度:测试结果是否与语言能力的理论结构相一致。

预测效度:测试成绩能否有效地预测考生在未来语言学习或实际语言运用中的表现。

2、信度(Reliability)反映测试结果的稳定性和一致性。

包括重测信度、复本信度、分半信度等。

重测信度:对同一批考生在不同时间进行相同测试,两次测试结果的相关性。

复本信度:使用两份内容相似但不完全相同的试卷对同一批考生进行测试,两次结果的相关性。

语言发展测验第三版TOLD―3

语言发展测验第三版TOLD―3

语言发展测验第三版TOLD―3语言发展测验第三版TOLD―3是一套用于测量儿童语言发展的工具,它以3-21岁的儿童为单位进行评估。

TOLD―3的设计覆盖了广泛的语言技能,其中包括言语开发、听觉过程、和口头表达能力。

该评估一般采用在心理检测中最常用的填空表达式做为指导,使被评估者能够进行容易理解,乐于参与的评估过程。

一、简介TOLD―3是由Ronald D. Mathers、Pearl O. Mathers和Ralph M. Reiff于1992年开发的语言发展测验,其目的是为了测量幼儿语言技能的发展水平。

它可以用于诊断儿童发育性失败以及发现语言发展上的异常。

TOLD―3可以扩展家庭语言环境的分析,从而提供一个全面的语言发展调查。

二、评估方法TOLD―3通常由两个部分组成,第一部分是调查教师报告,和家长报告,即基于家庭和社会環境和文化准备;第二部分是直接测试,由亲子活动以及语言语法测量组成。

除了分析报表之外,TOLD―3采用已被证明是有效的语言评估和发展调查技术,如心理词典测验、实验室中的特定游戏和言语行为仿真测验。

三、结果TOLD―3提供的结果是深入的、可信的和可比较的,它使得家庭可以建立一个有用的参考框架来衡量孩子在各个语言技能领域的发展水平以及语言发展的进步情况。

TOLD―3的结果可以帮助家庭成员、教师以及心理咨询师与家庭成员一起建立一个特定的计划来帮助被测量者改善语言发展能力。

四、广泛使用TOLD―3已经被广泛使用,廣泛運用于全球各地的學校,該工具已被刊登在著名的心理学、语言学和教育学期刊上,被视为一个有效的量化测量工具。

该评估的主要优势在于它的设计方法易于使用,而且成本不高,多种测试领域的结果可以一目了然,可以清楚地呈现出被测者在某一方面发展过程中所处的位置。

五、适应年龄TOLD―3与其他常用的语言发展测验不同,它被设计为针对3-21岁儿童的语言发展,而且每个测试领域都有针对不同儿童年龄段设置的评估工具。

应用语言学纲要第3版第三章-语言测试

应用语言学纲要第3版第三章-语言测试
(2)社会科学的实验并非在实验室里做的,而是在现实的世界里做的,所以 实验室条件对所有的被试者不容易保持一致,也很难区别实验和实际活动。
(3)人是一个十分复杂的统一体,即使是同一个被试者,在不同的外部环境 下,每一次测试都会显示出智力上的、生理上的、心理上的差异,从而影响 测试的结果。
第一节 实验方法在语言学中的应用
第一节 实验方法在语言学中的应用
三、测量的信度和效度
(一)信度的测量
2.信度估算的方法:再测法、平行测试法、对半法
3.评估员的信度问题
评估员在评阅主观性试题(如语言测试中的作文、口试等)时,常 常会有误差,这就牵涉到评估员的信度问题: 评估员的内部信度问题、阅 卷员之间的信度问题。
第一节 实验方法在语言学中的应用
二、语言测试的性质
(二)语言测试所包含的信息 语言测试应包括两方面的内容,一是语言方面,二是测试方面。因此,
语言测试需要考虑理论和实践方面的诸多因素,将密切关注以下三个方面的 信息。 1. 关于语言技能方面的信息 2. 关于语言发展方面的信息 3. 关于语言知识方面的信息
第二节 现代语言测试的理论框架
第三节 语言测试中试题的生产程序
二、试题的评分
(一)客观题的评分 (1)人工评分;(2)人工输入+机器评分;(3)机器读入+机器评分
大规模的测试应尽可能地采用第(2)种、第(3)种方式。
(二)主观题的评分 主观题的评分主要针对产生性技能和产生性运用的试题而言。 1. 制定统一的评分标准 2. 培训评分员 3. 改善评分方式
进行监控,而且为语言的教授和学习提供了试验和调查的方法。语言测试对应 用语言学的贡献可以归结为三点: (1)使应用语言学的理论框架转为实际运用。 (2)使教学大纲和教学安排的制定有了明确的目标和标准。 (3)为应用语言学的研究提供了方法论上的借鉴。源自第二节 现代语言测试的理论框架

语言测试学资料3

语言测试学资料3

Chapter 3(第三章)The Reliability of Testing(测试的信度)•The definition of reliability•The reliability coefficient•How to make tests more reliableWhat is reliability?Reliability refers to the trustworthiness and stability of candidates‟ test results.In other words, if a group of students were given the same test twice at different time, the more similar the scores would have been, the more reliable the test is said to be.How to establish the reliability of a test?It is possible to quantify the reliability of a test in the form of a reliability coefficient.They allow us to compare the reliability of different tests.The ideal reliability coefficient is 1.---A test with a reliability coefficient of 1 is one which would give precisely the same results for a particular set of candidates regardless ofwhen it happened to be administered.---A test which had a reliability coefficient of zero would give sets of result quite unconnected with each other.It is between the two extremes of 1 and zero that genuine test reliability coefficients are to be found.How high should we expect for different types of language tests? Lado saysGood vocabulary, structure and reading tests are usually in the 0.9 to 0.99 range, while auditory comprehension tests are more often in the 0.8 to 0.89 range.A reliability coefficient of 0.85 might be considered high for an oral production test but low for a reading test.The way to establish the reliability of a test:1. Test-retest methodIt means to have two sets of scores for comparison. The most obvious way of obtaining these is to get a group of subjects to take the same test twice.2. Split-half methodIn this method, the subjects take the test in the usual way, but each subject is given two scores. One score is for one half of the test, the second score is for the other half. The two sets of scores are then used to obtain the reliability coefficient as if the whole test had been taken twice.In order for this method to work, it is necessary for the test to be spilt into two halves which are really equivalent, through the careful matching of items (in fact where items in the test have been ordered in terms of difficulty, a split into odd-numbered items and even-numbered items may be adequate).3. Parallel forms method(the alternate forms method)It means to use two different forms of the same test to measure a group of students continuously or in a very short time. However, alternate forms are often simply not available.How to make tests more reliableAs we have seen, there are two components of test reliability: the performance of candidates from occasion to occasion, and the reliability of the scoring.Here we will begin by suggesting ways of achieving consistent performances from candidates and then turn our attention to scorer reliability.1.Take enough samples of behaviorOther things being equal, the more items that you have on a test, the more reliable that test will be.e.g.If we wanted to know how good an archer someone was, wewouldn‟t rely on the evidence of a single shot at the target. That one shot could be quite unrepresentative of their ability. To be satisfied that we had a really reliable measure of the ability we should want to see a large number of shots at the target.The same is true for language testing.It has been demonstrated empirically that the addition of further items will make a test more reliable.The additional items should be independent of each other and of existing items.e.g.A reading test asks the question:“Where did the thief hide the jewels?”If an additional item following that took the form: “What was unusual about the hiding place?”Would it make a full contribution to an increase in the reliability of the test?No.Why not?Because it is hardly possible for someone who got the original questions wrong to get the supplementary question right.We do not get an additional sample of their behavior, so the reliability of our estimate of their ability is not increased.Each additional item should as far as possible represent a fresh start for the candidate.Do you think the longer a test is, the more reliability we will get?It is important to make a test long enough to achieve satisfactory reliability, but it should not be made so long that the candidates become so bored or tired that the behavior that they exhibit becomes unrepresentative of their ability.2. Do not allow candidates too much freedomIn general, candidates should not be given a choice, and the range over which possible answers might vary should be restricted.Compare the following writing tasks:a) Write a composition on tourism.b) Write a composition on tourism in this country.c) Write a composition on how we might develop the tourist industry in this country.d) Discuss the following measures intended to increase the number of foreign tourists coming to this country:i)More/better advertising and / or information (where? What formshould it take?)ii)Improve facilities (hotels, transportation, communication etc.). iii)Training of personnel (guides, hotel managers etc.)The successive tasks impose more and more control over what iswritten. The fourth task is likely to be a much more reliable indicator of writing ability than the first.But in restricting the students we must be careful not to distort too much the task that we really want to see them perform.3. Write unambiguous itemsIt is essential that candidates should not be presented with items whose meaning is not clear or to which there is an acceptable answer which the test writer has not anticipated.The best way to arrive at unambiguous items is, having drafted them, to subject them to the critical scrutiny of colleagues, who should try as hard as they can to find alternative interpretations to the ones intended. 4. Provide clear and explicit instructionsThis applies both to written and oral instructions.If it is possible for candidates to misinterpret what they are asked to do, then on some occasions some of them certainly will.A common fault of tests written for the students of a particular teaching institution is the supposition that the students all know what is intended by carelessly worded instructions.The frequency of the complaint that students are unintelligent, have been stupid, have willfully misunderstood what they were asked to do, reveals that the supposition is often unwarranted.Test writers should not rely on the students‟ powers of telepathy toelicit the desired behavior.The best means of avoiding problems is the use of colleagues to criticize drafts of instructions (including those which will be spoken).Spoken instructions should always be read from a prepared text in order to avoid introducing confusion.5. Ensure that tests are well laid out and perfectly legibleToo often, institutional tests are badly typed (or handwritten), have too much text in too small a space, and are poorly reproduced. As a result, students are faced with additional tasks which are not ones meant to measure their language ability. Their variable performance on the unwanted tasks will lower the reliability of a test.6. Candidates should be familiar with format and testing techniquesIn any aspect of a test is unfamiliar to candidates, they are likely to perform less well than they would do otherwise. For this reason, every effort must be made to ensure that all candidates have the opportunity to learn just what will be required of them. This may mean the distribution of sample tests (or of past test paper), or at least the provision of practice materials in the case of tests set within teaching institutions.7. Provide uniform and non-distracting conditions of administrationThe greater the differences between one administration of a test and another, the greater the differences one can expect between a candidate‟s performance on the two occasions.Great care should be taken to ensure uniformity.e.g.Timing should be specified and strictly adhered to;The acoustic conditions should be similar for all administrations of a listening test. Every precaution should be taken to maintain a quiet setting with no distracting sounds or movements.How to obtain scorer reliability1. Use items that permit scoring which is as objective as possibleThis may appear to be a recommendation to use multiple choice items, which permit completely objective scoring. This is not intended. While it would be mistaken to say that multiple choice items are never appropriate, it is certainly true that there are many circumstances in which they are quite inappropriate. What is more, good multiple choice items are notoriously difficult to write and always require extensive pretesting.An alternative to multiple choice is the open-ended item which has a unique, possibly one-word, correct response which the candidates produce themselves. This too should ensure objective scoring, but in fact problems with such matters as spelling which makes a candidate‟s meaning unclear often make demands on the scorer‟s judgment. The longer the required response, the greater the difficulties of this kind.One way of dealing with this is to struct ure the candidate‟s response byproviding part of it.e.g.The open-ended question What was different about the results?may be designed to elicit the responseSuccess was closely associated with high motivation.This is likely to cause problems for scoring. Greater scorer reliability will probably be achieved if the question is followed by:_____ was more closely associated with _____.2. Make comparisons between candidates as direct as possibleThis reinforces the suggestion already made that candidates should not be given a choice of items and that they should be limited in the way that they are allowed to respond.Scoring the compositions all on one topic will be more reliable than if the candidates are allowed to choose from six topics, as has been the case in some well-known tests.3. Provide a detailed scoring keyThis should specify acceptable answers and assign points for partially correct responses. For high scorer reliability the key should be as detailed as possible in its assignment of points. It should be the outcome of efforts to anticipate all possible responses and have been subjected to group criticism. (This advice applies only where responses can be classed as partially or totally …correct‟, not in the case of compositions, forinstance.)4. Train scorersThis is especially important where scoring is more subjective. The scoring of compositions, for example, should hot be assigned to anyone who has not learned to score accurately compositions from past administrations. After each administration, patterns of scoring should be analyzed. Individuals whose scoring deviates markedly and inconsistently from the norm should not be used again.5. Agree acceptable responses and appropriate scores at outset of scoringA sample of scripts should be taken immediately after the administration of the test. Where there are compositions, archetypical representatives of different levels of ability should be selected. Only when all scorers are agreed on the scores to be given to these should real scoring begin.For short answer questions, the scorers should note any difficulties they have in assigning points (the key is unlikely to have anticipated every relevant response), and bring these to the attention of whoever is supervising that part of the scoring. Once a decision has been taken as to the points to be assigned, the supervisor should convey it to all the scorers concerned.6. Identify candidates by number, not nameScorers inevitably have expectations of candidates that they know.Except in purely objective testing, this will affect the way that they score. Studies have shown that even where the candidates are unknown to the scorers, the name on a script (or a photograph) will make a significant difference to the scores given.e.g.A scorer may be influenced by the gender or nationality of a name into making predictions which can affect the score given. The identification of candidates only by number will reduce such effects.7. Employ multiple, independent scoringAs a general rule, and certainly where testing is subjective, all scripts should be scored by at least two independent scores. Neither scorer should know how the other has scored a test paper. Scores should be recorded on separate score sheets and passed to a third, senior, colleague, who compares the two sets of scores and investigates discrepancies. Reliability and validityTo be valid a test must provide consistently accurate measurements. It must therefore be reliable. A reliable test, however, may not be valid at all.For example, as a writing test we might require candidates to write down the translation equivalents of 500 words in their own language. This could well be a reliable test; but it is unlikely to be a valid test of writing.In our efforts to make tests reliable, we must be wary of reducing their validity. This depends in part on what exactly we are trying to measure by setting the task. If we are interested in candidates‟ ability to structure a composition, then it would be hard to justify providing them with a structure in order to increase reliability. At the same time we would still try to restrict candidates in ways which would not render their performance on the task invalid.There will always be some tension between reliability and validity. The tester has to balance gains in one against losses in the other.。

语言测试学3

语言测试学3
back
proficiency test
Designed to measure people’s ability in a language regardless any training they may have had. not syllabus-based, but with a specification of what test candidates have to be able to do for a particular purpose. (e.g.TOEFL, PETS, IELTS)
Lado
3. It is more effective to test grammar since it is limited, but not situation which is infinite. Hence, the grammar items are deigned without context. 4. Forms should be the points for language tests because two languages are different when transferred.
If the NR test is properly The items or parts will be designed, the scores selected according to how attained will typically be adequately they represent distributed in the shape of a these ability levels or “normal” bell-shaped curve. content domains.

语言测试类型知识点总结

语言测试类型知识点总结

语言测试类型知识点总结语言测试的种类有很多,比如笔试、口试、听力测试、阅读测试等。

在进行语言测试时,需要根据测试的目的选择合适的测试方法和评分标准。

不同的语言测试项目需要测试不同的语言技能,比如词汇、语法、听力、口语、阅读、写作等。

下面我们将逐一介绍这些语言测试中的知识点。

一、词汇词汇是语言的基本组成部分,它是语言运用的基础。

在语言测试中,词汇测试通常包括词义、词性、词组、短语、语境等方面的考察。

测试者需要掌握词汇的拼写、发音、用法和搭配等方面的知识。

1、词义:词义是词汇的基本含义,它是词汇测试的重点内容之一。

测试者需要掌握词汇的基本含义,了解常用词汇的多种含义和用法。

2、词性:词性是词汇的重要属性,它决定了词汇的用法和搭配。

测试者需要掌握各种词性的词汇,理解它们在语言中的作用和用法。

3、词组和短语:词组和短语是语言中常用的固定搭配,它们在语言测试中也是重点内容之一。

测试者需要掌握常用的词组和短语,了解它们的意义和用法。

4、语境:语境是词汇使用的重要依据,它可以帮助理解词汇的含义和用法。

测试者需要在不同的语境中运用词汇,理解它们的具体含义和用法。

二、语法语法是语言的基本规则,它决定了语言的结构和用法。

在语言测试中,语法通常包括句子结构、时态、语态、语气、语序、主谓一致、形容词和副词的比较级和最高级、连词、代词等方面的考察。

1、句子结构:句子结构是语法的基本内容之一,它是语言表达的基本单位。

测试者需要掌握不同类型的句子结构,了解它们的构成和用法。

2、时态:时态是表示动作发生时间的一种语法形式,它在语言测试中也是重点内容之一。

测试者需要掌握各种时态的用法,理解它们的差异和应用场合。

3、语态:语态是表示句子主语和谓语之间关系的一种语法形式,它在语言测试中也是重点内容之一。

测试者需要掌握各种语态的用法,了解它们在句子中的作用和区别。

4、语气:语气是表示说话者的语气和情绪的一种语法形式,它在语言测试中也是重点内容之一。

《语言测验基本概念》完整版资料

《语言测验基本概念》完整版资料

┃性 质 ┃┃ 被试比较┃ 分布
┃ 预先┃制定的内容比较
┃┃
┠──────╂─┠────────────╂───────────╂──────────╂───────────────┨────┨

┃┃检验分的布目的 ┃ 区分一切被试的才┃干
┃ 看被试掌握了多少教学 ┃ ┃
┃内容效度〔cont┃┠┃e┃n─检t ─v验常─a的l─id─内模i─ty容╂〕参──┃ ┃─照─被─试──不─知─道──或─很╂少─┃知──道标───准┃┃──参─内被─容试─照完─┨全知道
和评分程序都一样,不能随意改动; • 第三,都经过实验,在进展了大量的阅历性研讨之后
第二讲:言语检验的根本概念
❖言语检验的作用和目的 ❖言语检验的种类 ❖言语检验的质量规范
言语检验的作用和目的
• 作用:科学地丈量出学习者的言语才干 • 目的: • 选拔 • 诊断 • 评价 • 预测 • 研讨
检验的种类
• 按用途〔目的〕划分
• 才干检验〔或程度检验〕proficiency test、
言语检验的作用和目的是什么?
干 作用:科学地丈量出学习者的言语才干
难易度〔facility value〕 ┣━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━┫
• 按照参照系统划分 想象效度(construct validity)
• 常模参照检验(norm-referenced test):所谓 常模〔norm〕是指规范化样本中检验的分数 分布
东西。 • 内容效度〔content validity〕 • 效标关联效度〔criterion-related validity 〕 • 共时效度(concurrent validity) • 预测效度(predictive validity) • 想象效度(construct validity) • 外表效度(face validity)

语言测试学资料

语言测试学资料

Chapter 5(第五章)Test Techniques and Measuring Overall Ability(测试的技巧和测试综合能力)What are test techniques?They are means of eliciting behavior from candidates which will tell us about their language abilities.What techniques do we need?What we need are techniques which:1.will elicit behavior which is reliable and valid indicator of the ability in which we are interested;2.will elicit behavior which can be reliably scored;3.are as economical of time and effort as possible;4.will have a beneficial backwash effect.In this chapter we’ll discuss one technique, multiple choice. Then we’ll examine the use of techniques which may be used to test ‘overall ability’.Multiple choice:(1) a stemEnid has been here ______ half an hour.(2) a number of optionsA. duringB. forC. whileD. since(3) Key(4) DistractorsWhat’s the most obvious advantage of multiple choice?•Scoring can be perfectly reliable.•Scoring should also be rapid and economical.•It can include more items than would otherwise be possible in a given period of time.The difficulties with multiple choice are as follows:1) The technique tests only recognition knowledgeIf there is a lack of fit between at least some candidates’ productive and receptive skills, then performance on a multiple choice test may give a quite inaccurate picture of those candidates’ ability.A multiple choice grammar test score, for example, may be a poor indicator of someone’s ability to use grammatical structures. The person who can identify the correct response in the item above may not be able to produce the correct form when speaking or writing. This is in part a question of construct validity; whether or not grammatical knowledge of the kind that can be demonstrated in a multiple choice test underlies the productive use of grammar.Even if it does, there is still a gap to be bridged between knowledge and use; if use is what we are interested in, that gap will mean that test scores are at best giving incomplete information.2) Guessing may have a considerable but unknowable effect on test scoresThe chance of guessing the correct answer in a three-option multiple choice item is one in three, or roughly thirty-three percent. On average we would expect someone to score 33 on a 100-item purely by guesswork. We would expect some people to score fewer than that by guessing, others to score more.The trouble is that we can never know what part of any particular individual’s score has come about through guessing. Attempts are sometimes made to estimate the contribution of guessing by assuming that all incorrect responses are the result of guessing, and by further assuming that the individual has had average luck in guessing. Scores are then reduced by the number of points the individual is estimated to have obtained by guessing.However, neither assumption is necessarily correct, and we cannot know that the revised score is the same as (or very close to) the one an individual would have obtained without guessing. While other testing methods may also involve guessing, we would normally expect the effect to be much less, since candidates will usually not have a restrictednumber of responses presented to them (with the information that one of them is correct).3) The technique severely restricts what can be testedThe basic problem here is that multiple choice items require distractors, and distractors are not always available. In a grammar test, it may not be possible to find three or four plausible alternatives to the correct structure. The result is that command of what may be an important structure is simply not tested. An example would be the distinction in English between the past tense and the present perfect.Certainly for learners at a certain level of ability, in a given linguistic context, there are no other alternatives that are likely to distract. The argument that his must be a difficulty for any item that attempts to test for this distinction is difficult to sustain, since other items that do not overtly present a choice m ay elicit the candidate’s usual behavior, without the candidate resorting to guessing.4) It is very difficult to write successful itemsA further problem with multiple choice is that, even where items are possible, good ones are extremely difficult to write. Professional test writers reckon to have to write many more items than they actually need for a test, and it is only after pretesting and statistical analysis of performance on the items that they can recognise the ones that are usable.It is s ome teachers’ experience that multiple choice tests that areproduced for use within institutions are often shot through with faults. Common amongst these are:more than one correct answer;no correct answer;there are clues in the options as to which is correct (for example the correct option may be different in length to the others);ineffective distractors.The amount of work and expertise needed to prepare good multiple choice tests is so great that, even if one ignored other problems associated with the technique, one would not wish to recommend it for regular achievement testing (where the same test is not used repeatedly) within institutions. Savings in time for administration and scoring will be outweighed by the time spent on successful test preparation.It is true that the development and use of item banks, from which a selection can be made for particular versions of a test, makes the effort more worthwhile, but great demands are still made on time and expertise.5) Backwash may be harmfulIt should hardly be necessary to point out that where a test which is important to students is multiple choice in nature, there is a danger that practice for the test will have a harmful effect on learning and teaching. Practice at multiple choice items (especially when, as happens, as much attention is paid to improving one’s educated guessing as to the content ofthe items) will not usually be the best way for students to improve their command of a language.6) Cheating may be facilitatedThe fact that the responses on a multiple choice test (a,b,c,d) are so simple makes them easy to communicate to other candidates nonverbally. Some defence against this is to have at least two versions of the test, the only difference between them being the order in which the options are presented.All in all, the multiple choice technique is best suited to relatively infrequent testing of large numbers of candidates. This is not to say that there should be no multiple choice items in tests produced regularly within institutions. In setting a reading comprehension test, for example, there may be certain tasks that lend themselves very readily to the multiple choice format, with obvious distractors presenting themselves in the text.There are real-tasks (say, a shop assistant identifying which one of four dresses a customer is describing) which are essentially multiple choice. The simulation in a test of such a situation would seem to be perfectly appropriate. What the reader is being urged to avoid is the excessive, indiscirimate, and potentially harmful use of the technique.Cloze, C-Test, and dictation: measuring overall abilityThe three techniques have in common the fact that they seem to offer economical ways of measuring overall ability in a language.The cloze technique has in addition been recommended as a means of measuring reading ability.Varieties of cloze procedureIn its original form, the cloze procedure involves deleting a number of words in a passage, leaving blanks, and requiring the person taking the test to attempt to replace the original words. After a short unmutilated ‘lead-in’, it is usually about every seventh word which is deleted.e.g.What is a college?Confusion exists concerning the real purposes, aims, and goals of a college. What are these? What should a college be?Some believe that the chief function 1.____ even a liberal arts college is 2.____ vocational one. I feel that the 3.__________ function of a college, while important, 4.____ nonetheless secondary. Others profess that the 5.____ purpose of a college is to 6.________ paragons…….The cloze procedure seemed very attractive. Cloze tests were easy to construct, administer and score. Reports of early research seemed to suggest that it mattered little which passage was chosen or which words were deleted; the result would be a reliable and valid test of candidates’underlying language abilities.Unfortunately, cloze could not deliver all that was promised on its behalf. For one thing, even if some underlying ability is being measured through the procedure, it is not possible to predict accurately from this what is people’s ability with respect to the variety of separately skills (speaking, writing etc) in which we are usually interested.Further, it turned out that different passages gave different results, as did the deletion of different sets of words in the same passage. Another matter for concern was the fact that intelligent and educated native speakers varied quite considerably in their ability to predict the missing words.What is more, some of them did less well than many non-native speakers. The validity of the procedure, even as a very general measure of overall ability, was thus brought into question.There seems to be fairly general agreement now that the cloze procedure cannot be depended upon automatically to produce reliable useful tests. There is need for careful selection of texts and some pretesting. The fact that deletion of every nth word almost always produces problematical items (for example impossible to predict the missing word), points to the advisability of a careful selection of words to delete, from the outset.The following cloze passage is constructed according to the aboveadvice.e.g.Choose the best word to fill each of the numbered blanks in the passage below. Write your answers in the space provided in the right hand margin. Write only ONE word for each blankThe earth’s vegetation is _(1)_ ofa web of life in which there areintimate and essential relationbetween plants and the earth,between plants and _(2)_ plants,between plants and animals.Sometimes we have no _(3)_ but todisturb _(4)_ relationships, but weshould _(5)_ so thoughtfully,with full awareness that_(6)_ we domay _(7)_ consequences remotein time and place.The deletions in the above passage were chosen to provide ‘interesting’ items. Most of them we might be inclined to regard as testing ‘grammar’, but to respond to them successfully more than grammatical ability is needed; processing of various features of context is usually necessary. Another feature is that native speakers of the same generalacademic ability as the students for whom the test was intended could be expected to provide acceptable responses to all of the items. The acceptable responses are themselves limited in number. Scores on cloze passages of this kind in the Cambridge Proficiency Examination have correlated very highly with performance on the test as a whole. It is this kind of cloze that experts would recommend for measuring overall ability.It may reasonably be thought that cloze procedures, since they produce purely pencil and paper tests, cannot tell us anything about the oral component of overall proficiency.However, some research has explored the possibility of using cloze passages based on tape-recordings of oral interaction to predict oral ability.Family reunionMother: I love that dress, Mum.Grandmother: Oh, it’s M and S.Mother: Is it?Grandmother: Yes, five pounds.Mother: My goodness, it’s not, Mum.Grandmother: But it’s made of that T-shirt stuff, so I don’t think it’ll wash very _________(1), you know, they go all…Mother: sort_______(2)… I know the kind, yes…Grandmother: Yes.Advice on creating cloze type passages1. The chosen passages should be at a level of difficulty appropriate to the people who are to take the test. If there is doubt about the level, a range of passages should be selected for pretesting. In deed it is always advisable to pretest a number of passages, as their behavior is not always predictable.2. The text should be of a style appropriate to the kind of language ability being tested.3. After a couple of sentences of uninterrupted text, deletions should be made at about every eighth or tenth word (the so called pseudo-random method of deletion). Individual deletions can then be moved a word or two to left or right, to avoid problems or to create interesting ‘items’.4. The passage should then be tried out on a good number of comparable native speakers and the range of acceptable responses determined.5. Clear instructions should be devised. In particular, it should be made clear what is to be regarded as a word (with examples of isn’t etc., where possible). Students should be assured that no one can possiblyreplace all the original words exactly. They should be encouraged to begin by reading the passage right through to get an idea of what is being conveyed (the correct responses early in the passage may be determined by later content).6. The layout of the test facilitates scoring. Scorers are given a card with the acceptable responses written in such a way as to lie opposite the candidates’ responses.7. Anyone who is to take a cloze test should have had several opportunities to become familiar with the technique. The more practice they have had, the more likely it is that their scores will represent their true ability in the language.8. Cloze test scores are not directly interpretable. In order to be able to interpret them we need to have some other measure of ability. If a series of cloze passages is to be used as a placement test, then the obvious thing would be to have all students currently in the institution complete the passages. Their cloze scores could then be compared with the level at which they are studying in the institution. Information from teachers as to which students could be in a higher (or lower) class would also be useful. Once a pattern was established between cloze test scores and class level, the cloze passages could be used as at least part of the placement procedure.The C-TestWhat is the C-Test?The C-test is really a variety of cloze, which its originators claim is superior to the kind of cloze described above. Instead of whole words, it is the second half of every second word which is deleted.e.g.There are usually five men in the crew of a fire engine. One o______ them dri____ the eng________. The lea____ sits bes_____ the dri______. The ot_______ firemen s_______ inside t________ cab o_______ the f________ engine. T____ leader h______ usually be________ in t_____ Five Ser_____ for ma____ years. H_____ will kn_____ how t_____ fight diff_____ sorts o_____ fires. S______, when t______ firemen arr____ at a fire, it is always the leader who decides how to fight a fire. He tells each fireman what to do.What’s the advantages of the C-Test ?The advantages of the C-Test over the more traditional cloze procedure are that only exact scoring is necessary (native speakers effectively scoring 100 per cent) and that shorter (and so more) passages are possible. This last point means that a wider range of topics, styles, and levels of ability is possible. The deletion of elements less than the word isalso said to result in a representative sample of parts of speech being so affected. By comparison with cloze, a C-Test of 100 items takes little space and not nearly so much time to complete (candidates do not have to read so much text).What’s the disadvantages of the C-Test?It is harder to read than a cloze passage, and correct responses can often be found in the surrounding text. Thus the candidate who adopts the right puzzle- solving strategy may be at an advantage over a candidate of similar foreign language ability. However, research would seem to indicate that the C-Test functions well as a rough measure of overall ability in a foreign language. The advice given above the development of cloze tests applies equally to the C-Test.DictationResearch revealed high correlations scores on dictation and scores on much longer and more complex tests. Examination of performance on dictation tests made it clear that words and word order were not really given; the candidate heard only stream of sound which had to be decoded into a succession of words, stored, and recreated on paper. The ability to identify words from context was now seen as a very desirable ability, one that distinguished between learners at different levels.Dictation tests give results similar to those obtained from cloze tests.In predicting overall ability they have the advantage of involving listening ability. That is probably the only advantage. Certainly they are as easy to create. They are relatively easy to administer, though not as easy as the paper-and-pencil cloze. But they are certainly not easy to score. It is recommended that the score should be the number of words appearing in their original sequence (misspelled words being regarded as correct as long as no phonological rule is broken). This works quite well when performance is reasonably accurate, but is still time-consuming. With poorer students, scoring becomes tedious.Because of this scoring problem, partial dictation may be considered as an alternative. In this, part of what is dictated is already printed on the candidate’s answer sheet. The candid ate has simply to fill in the gaps. It is then clear just where the candidate is up to, and scoring is likely to be more reiable.Like cloze, dictation may prove a useful technique where estimates of overall ability are needed. The same considerations should guide the choice of passages as with the cloze procedure. The passage has to be broken down into stretches that will be spoken without a break. These should be fairly long, beyond rote memory, so that the candidates will have to decode, store, and then re-encode what they hear. It is usual, when administering the dictation, to begin by reading the entire passage straight through. Then the stretches are read out, not too slowly, one afterthe other with enough time for the candidates to write down what they have heard(It is recommended that the reader silently spell the stretch twice as a guide to writing time).In summary, dictation and the varieties of cloze procedure discussed above provide neither direct information on the separate skills in which we are usually interested nor any easily interpreted diagnostic information. With careful application, however, they can prove useful to non-professional testers for purposes where great accuracy is not called for.。

语言测试学

语言测试学
测试方式
直接测试(Direct Test) 间接测试(Indirect Test)
测量形式
分离式测试(Discrete-point Test) 综合式测试(Integrative Test)
考分解释
常模参照测试(Norm-referenced Test)
标准参照测试(Criterionreferenced Test)
描述(统计图表)、解读(结果及原因)
二、语言测试的类别
测试目的
水平测试(Proficiency Test) 学业测试(Achievement Test) 学能测试(Scholastic aptitude
Test) 分级测试(Placement Test) 诊断测试(Diagnostic Test)
考试
课程 课程
考试
结业 结业
一、语言测试的功能
2 研究功能
研究问题及假设(Questions & Hypotheses) 研究对象及抽样(Objects & Sampling) 研究方法与过程(Methods & Procedures)
实验设计、测量工具、变量及类型、分析方法
研究结果与讨论(Results & Discussions)
(1-.47)/2 ×100=26
平均分μ
72 50 70 3 F(z)甲 =.75
标准差σ
8 2 10 1 F(z)乙 =.47
标准分 甲乙 -.25 -.38 2.5 1.5 1.9 2.5 4.15 3.62
1.15 .62
•α=0.05
•α=0.01
•α=0.001
xm in
* ** * ** * * *
x 3 x 2

语言学教程3试题及答案

语言学教程3试题及答案

语言学教程3试题及答案一、选择题(每题2分,共20分)1. 语言学的主要研究对象是什么?A. 语言的历史发展B. 语言的结构系统C. 语言的社会功能D. 语言的地理分布答案:B2. 下列哪项不是语言的属性?A. 任意性B. 线性C. 离散性D. 连续性答案:D3. 语音学研究的主要内容是什么?A. 语言的语法结构B. 语言的词汇系统C. 语言的发音规律D. 语言的书写形式答案:C4. 语法学的研究对象是什么?A. 语言的声音系统B. 语言的词汇系统C. 语言的语法结构D. 语言的语义内容答案:C5. 语用学主要研究什么?A. 语言的发音规则B. 语言的语法规则C. 语言的使用环境D. 语言的书写规则答案:C6. 语言的最小意义单位是什么?A. 音素B. 词C. 语素D. 句答案:C7. 以下哪个选项是语言的交际功能?A. 表达思想B. 传递信息C. 娱乐消遣D. 教育指导答案:B8. 语言的演变主要受到哪些因素的影响?A. 社会变迁B. 地理隔离C. 文化交流D. 所有以上选项答案:D9. 语言的同源词指的是什么?A. 同一词根派生出的词B. 词义相近的词C. 形式和意义相同的词D. 形式和意义都不同的词答案:A10. 下列哪项是社会语言学的研究内容?A. 语言的语音变化B. 语言的词汇变化C. 语言与社会的关系D. 语言的语法变化答案:C二、填空题(每题2分,共20分)1. 语言学是研究________的科学。

答案:人类语言2. 语言的任意性是指语言的________与________之间没有必然的联系。

答案:形式意义3. 语言的线性是指语言在时间上是________的。

答案:连续4. 语言的离散性是指语言的单位是________的。

答案:有限5. 语音学是研究人类语言的________规律的学科。

答案:发音6. 语法学是研究语言的________和________的学科。

答案:结构规律7. 语用学是研究语言在________中的使用情况的学科。

普通话考试复习资料-汇总

普通话考试复习资料-汇总

普通话等级考试2018年最新普通话等级考试(规则、练习材料、试题)全国通用大纲教材精华版一、规则国家语言文字工作委员会颁布的《普通话水平测试等级标准》是划分普通话水平等级的全国统一标准。

普通话水平等级分为三级六等,即一、二、三级,每个级别再分出甲乙两个等次;一级甲等为最高,三级乙等为最低。

应试人的普通话水平根据在测试中所获得的分值确定。

普通话水平测试等级标准如下:一级甲等朗读和自由交谈时,语音标准,语汇、语法正确无误,语调自然,表达流畅。

测试总失分率在3%以内。

乙等朗读和自由交谈时,语音标准,语汇、语法正确无误,语调自然,表达流畅。

偶有字音、字调失误。

测试总失分率在8%以内。

二级甲等朗读和自由交谈时,声韵调发音基本标准,语调自然,表达流畅。

少数难点音(平翘舌音、前后鼻尾音、边鼻音等)有时出现失误。

语汇、语法极少有误。

测试总失分率在13%以内。

乙等朗读和自由交谈时,个别调值不准,声韵母发音有不到位现象。

难点音较多(平翘舌音、前后鼻尾音、边鼻音、fu - hu 、 z - zh -j 、送气不送气、i- ü不分、保留浊塞音、浊塞擦音、丢介音、复韵母单音化等),失误较多。

方言语调不明显,有使用方言词、方言语法的情况。

测试总失分率在20%以内。

三级甲等朗读和自由交谈时,声韵母发音失误较多,难点音超出常见范围,声调调值多不准。

方言语调明显。

语汇、语法有失误。

测试总失分率在30%以内。

乙等朗读和自由交谈时,声韵调发音失误多,方音特征突出。

方言语调明显。

语汇、语法失误较多。

外地人听其谈话有听不懂的情况。

测试总失分率在40%以内。

普通话水平划分为三个级别,每个级别内划分两个等次,其中:97分及其以上,为一级甲等;92分及其以上但不足97分,为一级乙等;87分及其以上但不足92分,为二级甲等;80分及其以上但不足87分,为二级乙等;70分及其以上但不足80分,为三级甲等;60分及其以上但不足70分,为三级乙等。

语言测试教程第三章

语言测试教程第三章

交际/能力法(interactional/ability approach)
三个侧重点
考试的表面效度 采用“现实生活法”的考试必须给人 一种真实感。
考试的预测效度 考试要能够从学生在某个仿真情景的 表现中预测他以后在其他情景中的语言表现力。
考试的内容效度 考试所选择的内容与对某一仿真情景 来说要有必要性。 面试型口试,大学英语四、六级考试
Thank you!
第三章
1202 任亚兰
....
3.3 真实性
3.4交互性
....
3.3
含义
真实性
charles W.Bachman 真实性指某一语言测试任务(a given language test task )与实 际语言运用任务(targetlanguage -use task)在特征方面 的对应程度(degree of correspondence) .
美国的语言人类学家
Garry Palmer
关系图
语言测试任务特征
真实性
实际语言任务特征
任务特征: 可观察类(形式)特征:方式;内容;情景等 非观察类(性质)特征:语言运用性质;语言能力等
对应程度: 形式上的对应 、性质上的对应
当今语言测试界在考试真实方面的两大派别
现实生活法(real-life approach):即通过设计能 复制具体语言使用环境的考试来达到真实性。
交际/能力法
它的特点是强调交际语言运用过程中的显著特征,如"语言 使用者 情景与语篇之间的相互作用",而不是某一个语言运 用具体行为。 真实性不一定反映在语言运用实景中,而是应该体现在考试 是否真正测试了构成语言能力的那些特征。 它所关注的是语言能力或者能力构念,也就是构念效度。

普通话水平测试的复习资料

普通话水平测试的复习资料

普通话水平测试的复习资料普通话水平测试对于很多人来说是一项重要的语言能力评估。

无论是为了求职、升学,还是为了提升自身的语言素养,做好充分的复习准备都至关重要。

下面就为大家详细介绍一下普通话水平测试的复习资料和方法。

一、了解普通话水平测试的内容和要求普通话水平测试主要包括读单音节字词、读多音节词语、朗读短文、命题说话四个部分。

1、读单音节字词这部分主要考查考生对普通话声母、韵母和声调的准确发音。

需要注意的是,读音要清晰、准确,声调要到位。

2、读多音节词语除了考查单个字的读音,还会考查词语的连读变调、轻声、儿化等。

3、朗读短文考查考生的语音语调、流畅度和对文章的理解。

要注意停连、重音、语气等方面的处理。

4、命题说话这是测试中较难的部分,要求考生在规定时间内围绕给定的题目进行口头表达。

主要考查语言的规范程度、流畅度和逻辑思维能力。

二、复习资料的选择1、教材可以选择权威的普通话水平测试教材,这些教材通常会系统地介绍测试的内容、要求和复习方法,并配有大量的练习题和示范音频。

2、在线课程现在有很多在线平台提供普通话水平测试的课程,包括视频讲解、模拟测试等,方便考生随时随地学习。

3、手机应用程序有一些专门的普通话学习 APP,具有发音练习、测试评估等功能,可以帮助考生利用碎片化时间进行复习。

4、音频资料可以收听普通话标准的广播、新闻、有声读物等,培养语感。

三、具体的复习方法1、发音练习针对声母、韵母和声调进行专项练习。

可以通过模仿标准的发音,对照镜子观察口型,以及使用录音设备自我检查等方式来提高发音的准确性。

2、词汇积累多学习和掌握一些常见的多音字、易错字和生僻字的读音,同时注意词语的正确用法和搭配。

3、朗读训练选择适合的短文进行朗读练习,注意语速、语调的控制,以及情感的表达。

可以先听示范朗读,然后模仿练习,逐渐形成自己的朗读风格。

4、说话练习针对命题说话的题目,提前准备思路和提纲。

在练习时,要注意语言的规范性,避免使用方言词汇和语法,尽量做到表达清晰、有条理。

普通话学习资料

普通话学习资料

普通话学习资料普通话,全国通用的标准汉语,是中国的官方语言,也是国际交流的重要工具。

掌握好普通话对于提高沟通能力、拓展职业发展、加深文化交流都具有重要的作用。

为了帮助大家更好地学习普通话,以下是一些高质量的普通话学习资料推荐。

一、教材与练习册1.《普通话水平测试大纲与实施办法》这是国家教育部颁布的普通话水平考试大纲,其中包含了普通话水平测试的各项要求、评分标准等详细内容,对于学习者了解普通话水平测试的规则与要求非常有帮助。

2.《普通话规范训练教程》由多位知名语言学家和教育专家编写的教材,全面讲解普通话的发音、语调、语音变化等方面知识,并提供大量的实例和练习,帮助学习者巩固掌握普通话的基本规范。

3.《实用标准普通话教程》该教材适合初学者,通过丰富的对话、情景演练、听力练习等,帮助学习者快速入门普通话,并在实际生活中灵活运用。

4.《说普通话》系列教材这是一套专为非汉语背景学习者编写的普通话教材,通过生动有趣的故事、实用的对话、语音训练等,引导学习者轻松掌握普通话的基本要点。

二、在线学习资源1.国家汉办普通话水平测试网该网站提供普通话水平测试的在线模拟题和真题,学习者可以通过刷题来检测和提升自身的普通话水平。

2.普通话口语训练平台该平台通过模拟情景对话、录音对比等方式,帮助学习者提高普通话口语表达能力,并提供个性化的学习建议和反馈。

3.国家教育资源公共服务平台在该平台上,可以找到各类普通话学习资料,如教材、课件、教育视频等,覆盖了不同年龄段和不同学习层次的学习者。

三、听力与口语训练1.普通话口音在线测试通过该测试网站,学习者可以上传自己的语音录音进行普通话口音评测,以此了解自身的发音问题并进行纠正。

2.普通话电台和新闻节目收听普通话电台和新闻节目,可以提高对普通话的听力理解和语音感知能力,同时也能了解中国的社会、文化等信息。

3.普通话短视频教程在各大视频平台上,可以找到很多优质的普通话学习视频教程,内容包括发音、语调、口语练习等方面,学习者可以根据自己的需要选择相应的教学视频。

语言学概论3第三章 测试题

语言学概论3第三章 测试题

第三章测试题第三章解释下列术语(共个小题,每题2分,共分。

)1、语音2、音节3、音位4、元音5、音质6、音高7、音强8、音长9、辅音 10、语音四要素 11、国际音标 12、音位变体 13、音素 14、音标 15、语流音变请用国际音标给下列汉字注音:(本大题共10分)(该题必考)1、春岸桃花水,云帆枫树林。

偷生长避地,适远更沾襟。

(杜甫《南征》)2、清晨入古寺,初日照高林。

竹径通幽处,禅房花木深。

(常建《题破山寺后禅院》)3、闻道船中病,似忧亲弟兄。

信来从水路,身去到柴城。

(贾岛《寄李存穆》)4、楼倚霜树外,镜天无一毫。

南山与秋色,气势两相高。

(杜牧《长安秋望》)5、向晚意不适,驱车登古原。

夕阳无限好,只是近黄昏。

(李商隐《登乐游原》)6、移舟泊烟渚,日暮客愁新野。

旷天低树,江清月近人。

(孟浩然《宿建德江》)7、稍怜公事退,复遇夕阳时。

北朔霜凝竹,南山水入篱。

(贾岛《酬鄠县李廓少府见寄》)8、小树开朝径,长茸湿夜烟,柳花惊雪浦,麦雨涨溪田。

(李贺《南园十三首》)9、少孤为客早,多难识君迟。

掩泪空相向,风尘何处期?(卢纶《送李端》)10、偶来松树下,高枕石头眠。

山中无历日,寒尽不知年。

(李膺《隐逸》)11、岭外音书绝,经冬复立春。

近乡情更怯,不敢问来人。

(李频《渡汉江》)12、君自故乡来,应知故乡事。

来日绮窗前,寒梅著花未?(王维《杂诗》)13、千山鸟飞绝,万径人踪灭。

孤舟蓑笠翁,独钓寒江雪。

(柳宗元《江雪》)14、独坐幽篁里,弹琴复长啸。

深林人不知,明月来相照。

(王维《竹里馆》)15、炉火照天地,红星乱紫烟。

赧郎明月夜,歌曲动寒川。

(李白《秋浦歌》)16、水面细风生,菱歌慢慢声。

客亭临小市,灯火夜妆明。

(选自王建《江馆》)单项选择题(在每小题列出的四个选项中只有一个选项是符合题目要求的,请将正确选项前的字母填在题后的括号内。

本大题共小题,每题1分,共分)1汉语普通话语音系统中没有的发音部位是()。

语言测试及它的方法 复习大纲 内容全面,尊重原创!!

语言测试及它的方法 复习大纲 内容全面,尊重原创!!
消极词汇是指学生在阅读时应能够认知的词汇。
2. 词汇测试的效度、信度、区分度,主要依据词汇的代表性和档次的划分。
3. 词汇测试的题型:配对型、取代型、填空型。
词的使用牵涉三方面的因素:意义、搭配、语法。
4. 语法测试常见题型:多项选择、识别错误、填空、句型转换、配对。
5. 测试阅读能力的方法:正误判断、完成句子、简答题、组句成段、多项选择、完形填空。
多项选择的命题要求:语言正确、地道、得体、简洁;避免试题的偏颇性;选择项与题干的相容性问题;尽可能保持选择项的相似性;题干或者干扰项不要为答题提供线索;避免出现轨迹题。
4. 填充题:测量的是语言的运用能力,而不是辨认能力,测试效度高。综合填充题型又称完形填空。
3. 制定考试细目表:包括考试内容所占比重、题型、题量、考试时间分配。
第三代:交际语言测试,Bachmann,CLA
2. Bachman 的语言测试模式
特点:对于语言能力的认识更加全面深刻;指出了测试工具与目标语言语境的关系。
构成:语言能力;策略能力;心理生理机制
语言能力:语言组织能力(语法能力,语篇能力);语言使用能力(语义能力,功能能力,社会语言能力)
策略能力:评估策略;确定目标策略;制定计划策略;执行计划策略
3. 口语测试的评分方法:分析法、综合法。
第十一章 如何设计写作测试
1. 写作测试的最大优点是效度高,它不仅能考察考生的输出性技能,同时也能考察考生的接受性技能,同时能测试语言的各个层次和范畴,对教学有很好的反拨作用。缺点是信度低,因为它属于主观测试,评分无法客观化。
2. 写作测试的评分方法:机械法、印象法、分析法。
曲线越抖说明分数越集中,越缓说明分数越分散。

语言测试学

语言测试学

语言测试学一.判断1.完形填空题(Cloze)是综合测试法的典型题型。

对2.学期期末考试属于常模参照考试(Norm-Referenced Test)。

对3.对定序数据,可计算均值但不可计算比率。

错4.一组数据呈正态分布时,正负2个标准差之间的概率约为95%。

对5.均值小于众数时数据分布成正偏移(尾巴朝左)。

错6.根据信息论,确定性程度越高,信息量越大。

错7.话语信息认知处理观认为,由于已知内容不能移除受话方的不确定性,具备信息价值。

错8.考试规范属于纲领性文件,在宏观层面上对考试的要求或规定进行概括描述。

对9.效度整体观认为测试效度可表示为一个抽象数值,即效度系数。

错10.累进效度观认为测试效度是各环节效度之和。

对二. 填空1. 反拨效应“三P模式”的名称为三个英文首字母缩写。

Perception, procedure, product2. 托福考试(TOEFL)的英文名称为Test of English as a Foreign language .3. 美国“教育考试服务中心”(ETS)的英文名称为Educational Testing Service 。

4. 根据控制论的观点,动物、机器和社会都是自我调控系统,都是通过信息反馈保持状态平衡的。

5. 话语信息认知处理观提出从事物的属性、关系和过程来界定话语信息的语义范畴。

6. 单一效度论时期,测试效度称为校标关联效度,效度的验证方法主要是要求当前测试与每个外部测试的相关程度7.效度分类说时期,效度主要分为三类,即校标关联效度、内容效度和构念效度8.校标整体观否定效度具有类别之分,强调效度的整体多维性,认为效度指的是证据和理论支持符合测试使用所需关联的程度。

9.在解释辩论,以证据为中心的设计和测试使用辩论,这三大测试辩论模型中。

其中的“声称”实则并非声明而实为假设10.有没有限定词是图尔明声明和三段论结论的根本区别之所在——前者有之,固是理性结论,后者没有,固是绝对断言111 Analysis2 Hypothesis3 Qualifie4.Construct5. Specification6 Task7 Response8 Score9 consequence10 Criterion三.实践题(30’)下面语段是从一篇关于“homeschooling”的文章中节选出来的。

(完整word版)语言测试学资料 2(word文档良心出品)

(完整word版)语言测试学资料 2(word文档良心出品)

Chapter 2(第二章)The Validity of Language Testing(语言测试的效度)What is validity?A test is said to be valid if it measures accurately what it is intended to measure.Validity has a number of aspects:•Content validity内容效度•Criterion-related validity标准相关效度•Construct validity 编制效度•Face validity表面效度•The use of validity效度的用途Content Validity:A test is said to have content validity if its content constitutes a representative sample of the language skills, structures, etc. with which it is meant to be concerned.e.g.A grammar test must be made up of items testing knowledge or control of grammar.But this in itself does not ensure content validity. The test would have content validity only if it included a proper sample of the relevant structures.What are the relevant structures will depend upon the purpose of the test.In order to judge whether or not a test has content validity, we need a specification of the skills or structures etc. that it is meant to cover.Such a specification should be made at a very early stage in test construction.It isn’t to be expected that everything in the specification will always appear in the test; there may simply be too many things for all of them to appear in s single test.But it will provide the test constructor with the basis for making a principled selection of elements for inclusion in the test.A comparison of test specification and test content is the basis for judgments as to content validity. Ideally these judgments should be made by people who are familiar with language teaching and testing but who are not directly concerned with the production of the test in question.What is the importance of content validity?First, the greater a test’s content validity, the more likely it is to be an accurate measure of what it is supposed to measure.A test in which major areas identified in the specification areunder-represented ----- or not represented at all ----- is unlikely to be accurate.Secondly, such a test is likely to have a harmful backwash effect. Areas which are not tested are likely to become areas ignored in teaching and learning. Too often the content of tests is determined by what is easy to test rather than what is important to test.The test safeguard against this is to write full test specifications and to ensure that the test content is a fair reflection of these.DiscussionCase 1Do you think an achievement test for intermediate learners to contain just the same set of structures as one for advanced learners has content validity?No.Case 2About 20 years ago, the candidates of university entrance examination in America was given a composition topic:Is photography an art or science? Discuss.Do you think this test has validity?No.Case 3The intention of other people concerned, such as the Minister ofDefense, to influence the government leaders to adapt their policy to fit in with the demands of the right wing, cannot be ignored.What is the subject of “cannot be ignored”?A.the intentionB.other people concernedC.the Minister of DefenseD.the demands of the right wingWhat does this item want to measure, reading comprehension or sentence structure?Criterion-related validity:Another approach to test validity is to see how far results on the test agree with those provided by some independent and highly dependable assessment of the candidate’s ability. This independent assessment is thus the criterion measure against which the test is validated.There are essentially two kinds of criterion-related validity: concurrent validity(共时效度) and predictive validity(预时效度).What is concurrent validity?Concurrent validity is established when the test and the criterion are administered at about the same time.e.g.Course objectives call for an oral component as part of the final achievement test.The objectives may list a large number of ‘functions’ which students are expected to perform orally, to test all of which might take 45 minutes for each student. This could well be impractical.Perhaps it is felt that only ten minutes can be devoted to each student for the oral component.The question then arises:Can such a ten-minute session give a sufficiently accurate estimate of the students ability with respect to the functions specified in the course objectives? Is it a valid measure?From the point of view of content validity, this will depend on how many of the functions are tested in the component, and how representative they are of the complete set of functions included in the objectives.Every effort should be made when designing the oral component to give it content validity. Once this has been done, however, we can go further. We can attempt to establish the concurrent validity of the component.How to do it?We should choose at random a sample of all the students taking the test.These students would then be subjected to the full 45 minute oral component necessary for coverage of all the functions, using perhaps fourscorers to ensure reliable scoring.This would be the criterion test against which the shorter test would be judged.The students’ scores on the full test would be compared with the ones they obtained on the ten-minute session, which would have been conducted and scored in the usual way, without knowledge of their performance on the longer version.If the comparison between the two sets of scores reveals a high level of agreement, then the shorter version of oral component may be considered valid, inasmuch as it gives results similar to those obtained with the longer version.If, on the other hand, the two sets of scores show little agreement, the shorter version cannot be considered valid; it cannot be used as a dependable measure of achievement with respect to the functions specified in the objectives.Of course, if ten minutes really is all that can be spared for each student, then the oral component may be included for the contribution that it makes to the assessment of students’ overall achievement and for its backwash effect. But it cannot be regarded as an accurate measure in itself.‘a high level of agreement’‘little agreement’How is the level of agreement measured?Standard procedures for comparing sets of scores :‘validity coefficient’a mathematical measure of similarityPerfect agreement between two sets of scores will result in a validity coefficient of 1.Total lack of agreement will give a coefficient of zero.It is best to square that coefficient.a coefficient of 0.7 between the two oral testsSquared0.49converted to a percentage,49 per centOn the basis of this, we can say that the scores on the short test predict 49 per cent of the variation in scores on the longer test.In broad terms, there is almost 50 per cent agreement between one set of scores and the other.A coefficient of 0.5 would signify 25 per cent agreement;A coefficient of 0.8 would indicate 64 per cent agreement.It is important to note that a ‘level of agreement’ of 50 per cent does not mean that 50 per cent of the students would each have equivalent scores on the two versions. We are dealing with an overall measure ofagreement that does not refer to the individual scores of students.What is predictive validity?Predictive validity concerns the degree to which a test can predict candidates’ future performance.e.g.How well could a proficiency test predict a student’s a bility to cope with a graduate course at a British university?The choice of criterion measure raises interesting issues:Should we rely on the subjective and untrained judgments of supervisors?How helpful is it to use final outcome as the criterion measure when so many factors other than ability in English (such as subject knowledge, intelligence, motivation, health and happiness) will have contributed to every outcome?Where outcome is used as the criterion measure, a validity coefficient of around 0.4 (only 20 per cent agreement) is about as high as one can expect.This is partly because of the other factors, and partly because those students whose English the test predicted would be inadequate are not normally permitted to take the course, and so the te st’s (possible) accuracy in predicting problems for those students goes unrecognized. As a result, a validity coefficient of this order is generally regarded assatisfactory.e.g.To validate a placement test:Placement tests attempt to predict the most appropriate class for any particular student. Validation would involve an enquiry, once courses were under way, into the proportion of students who were thought to be misplaced. It would then be a matter of comparing the number of misplacements (and their effect on teaching and learning) with the cost of developing and administering a test which would place students more accurately.What criterion measure should we choose?Should we choose an assessment of the student’s English as perceived by his or her supervisor at the university, or the outcome of the course (pass/fail etc.)?Construct validityA test, part of a test, or a testing technique is said to have construct validity if it can be demonstrated that it measures just the ability which it is supposed to measure.The word ‘construct’ refers to any underlying ability (or trait) which is hypothesized in a theory of language ability.One might hypothesize, for example, that the ability to read involves a number of sub-abilities, such as the ability to guess the meaning ofunknown words from the context in which they are met.It would be a matter of empirical research to establish whether or not such a distinct ability existed and could be measured. If we attempted to measure that ability in a particular test,It would be a matter of empirical research to establish whether or not such a distinct ability existed and could be measured. If we attempted to measure that ability in a particular test, then that part of the test would have construct validity only if we were able to demonstrate that we were indeed measuring just that ability.The direct measurement of writing ability should not cause us too much concern. Once we try to measure such an ability indirectly, however, we can no longer take for granted what we are doing. We need to look to a theory of writing ability for guidance as to the form an indirect test should take, its content and techniques.Construct validation is a research activity, the means by which theories are put to the test and are confirmed, modified, or abandoned. It is through construct validation that language testing can be put on a sounder, more scientific footing.But it will not all happen overnight; there is a long way to go. In the meantime, the practical language tester should try to keep abreast of what is known. When in doubt, where it is possible, direct testing of abilities is recommended.Face validityA test is said to have face validity if it looks as if it measures what it is supposed to measure.e.g.A test which pretended to measure pronunciation ability but which did not require the candidate to speak (and there have been some) might be thought to lack face validity.What’s the importance of face validity?A test which does not have face validity may not be accepted by candidates, teachers, education authorities or employers.It may simply not be used; and if it is used, the candidates’ reaction to it may mean that they do not perform on it in a way that truly reflects their ability.The use of validityEvery effort should be made in constructing tests to ensure content validity. Where possible, the tests should be validated empirically against some criterion. Particularly where it is intended to use indirect testing, reference should be made to the research literature to confirm that measurement of the relevant underlying constructs has been demonstrated using the testing techniques that are to be used (this may often result in disappointment --- another reason for favoring direct testing!).Any published test should supply details of its validation, withoutwhich its validity (and suitability) can hardly be judged by a potential purchaser. Tests for which validity information is not available should be treated with caution.。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Chapter 3(第三章)The Reliability of Testing(测试的信度)•The definition of reliability•The reliability coefficient•How to make tests more reliableWhat is reliability?Reliability refers to the trustworthiness and stability of candidates‟ test results.In other words, if a group of students were given the same test twice at different time, the more similar the scores would have been, the more reliable the test is said to be.How to establish the reliability of a test?It is possible to quantify the reliability of a test in the form of a reliability coefficient.They allow us to compare the reliability of different tests.The ideal reliability coefficient is 1.---A test with a reliability coefficient of 1 is one which would give precisely the same results for a particular set of candidates regardless ofwhen it happened to be administered.---A test which had a reliability coefficient of zero would give sets of result quite unconnected with each other.It is between the two extremes of 1 and zero that genuine test reliability coefficients are to be found.How high should we expect for different types of language tests? Lado saysGood vocabulary, structure and reading tests are usually in the 0.9 to 0.99 range, while auditory comprehension tests are more often in the 0.8 to 0.89 range.A reliability coefficient of 0.85 might be considered high for an oral production test but low for a reading test.The way to establish the reliability of a test:1. Test-retest methodIt means to have two sets of scores for comparison. The most obvious way of obtaining these is to get a group of subjects to take the same test twice.2. Split-half methodIn this method, the subjects take the test in the usual way, but each subject is given two scores. One score is for one half of the test, the second score is for the other half. The two sets of scores are then used to obtain the reliability coefficient as if the whole test had been taken twice.In order for this method to work, it is necessary for the test to be spilt into two halves which are really equivalent, through the careful matching of items (in fact where items in the test have been ordered in terms of difficulty, a split into odd-numbered items and even-numbered items may be adequate).3. Parallel forms method(the alternate forms method)It means to use two different forms of the same test to measure a group of students continuously or in a very short time. However, alternate forms are often simply not available.How to make tests more reliableAs we have seen, there are two components of test reliability: the performance of candidates from occasion to occasion, and the reliability of the scoring.Here we will begin by suggesting ways of achieving consistent performances from candidates and then turn our attention to scorer reliability.1.Take enough samples of behaviorOther things being equal, the more items that you have on a test, the more reliable that test will be.e.g.If we wanted to know how good an archer someone was, wewouldn‟t rely on the evidence of a single shot at the target. That one shot could be quite unrepresentative of their ability. To be satisfied that we had a really reliable measure of the ability we should want to see a large number of shots at the target.The same is true for language testing.It has been demonstrated empirically that the addition of further items will make a test more reliable.The additional items should be independent of each other and of existing items.e.g.A reading test asks the question:“Where did the thief hide the jewels?”If an additional item following that took the form: “What was unusual about the hiding place?”Would it make a full contribution to an increase in the reliability of the test?No.Why not?Because it is hardly possible for someone who got the original questions wrong to get the supplementary question right.We do not get an additional sample of their behavior, so the reliability of our estimate of their ability is not increased.Each additional item should as far as possible represent a fresh start for the candidate.Do you think the longer a test is, the more reliability we will get?It is important to make a test long enough to achieve satisfactory reliability, but it should not be made so long that the candidates become so bored or tired that the behavior that they exhibit becomes unrepresentative of their ability.2. Do not allow candidates too much freedomIn general, candidates should not be given a choice, and the range over which possible answers might vary should be restricted.Compare the following writing tasks:a) Write a composition on tourism.b) Write a composition on tourism in this country.c) Write a composition on how we might develop the tourist industry in this country.d) Discuss the following measures intended to increase the number of foreign tourists coming to this country:i)More/better advertising and / or information (where? What formshould it take?)ii)Improve facilities (hotels, transportation, communication etc.). iii)Training of personnel (guides, hotel managers etc.)The successive tasks impose more and more control over what iswritten. The fourth task is likely to be a much more reliable indicator of writing ability than the first.But in restricting the students we must be careful not to distort too much the task that we really want to see them perform.3. Write unambiguous itemsIt is essential that candidates should not be presented with items whose meaning is not clear or to which there is an acceptable answer which the test writer has not anticipated.The best way to arrive at unambiguous items is, having drafted them, to subject them to the critical scrutiny of colleagues, who should try as hard as they can to find alternative interpretations to the ones intended. 4. Provide clear and explicit instructionsThis applies both to written and oral instructions.If it is possible for candidates to misinterpret what they are asked to do, then on some occasions some of them certainly will.A common fault of tests written for the students of a particular teaching institution is the supposition that the students all know what is intended by carelessly worded instructions.The frequency of the complaint that students are unintelligent, have been stupid, have willfully misunderstood what they were asked to do, reveals that the supposition is often unwarranted.Test writers should not rely on the students‟ powers of telepathy toelicit the desired behavior.The best means of avoiding problems is the use of colleagues to criticize drafts of instructions (including those which will be spoken).Spoken instructions should always be read from a prepared text in order to avoid introducing confusion.5. Ensure that tests are well laid out and perfectly legibleToo often, institutional tests are badly typed (or handwritten), have too much text in too small a space, and are poorly reproduced. As a result, students are faced with additional tasks which are not ones meant to measure their language ability. Their variable performance on the unwanted tasks will lower the reliability of a test.6. Candidates should be familiar with format and testing techniquesIn any aspect of a test is unfamiliar to candidates, they are likely to perform less well than they would do otherwise. For this reason, every effort must be made to ensure that all candidates have the opportunity to learn just what will be required of them. This may mean the distribution of sample tests (or of past test paper), or at least the provision of practice materials in the case of tests set within teaching institutions.7. Provide uniform and non-distracting conditions of administrationThe greater the differences between one administration of a test and another, the greater the differences one can expect between a candidate‟s performance on the two occasions.Great care should be taken to ensure uniformity.e.g.Timing should be specified and strictly adhered to;The acoustic conditions should be similar for all administrations of a listening test. Every precaution should be taken to maintain a quiet setting with no distracting sounds or movements.How to obtain scorer reliability1. Use items that permit scoring which is as objective as possibleThis may appear to be a recommendation to use multiple choice items, which permit completely objective scoring. This is not intended. While it would be mistaken to say that multiple choice items are never appropriate, it is certainly true that there are many circumstances in which they are quite inappropriate. What is more, good multiple choice items are notoriously difficult to write and always require extensive pretesting.An alternative to multiple choice is the open-ended item which has a unique, possibly one-word, correct response which the candidates produce themselves. This too should ensure objective scoring, but in fact problems with such matters as spelling which makes a candidate‟s meaning unclear often make demands on the scorer‟s judgment. The longer the required response, the greater the difficulties of this kind.One way of dealing with this is to struct ure the candidate‟s response byproviding part of it.e.g.The open-ended question What was different about the results?may be designed to elicit the responseSuccess was closely associated with high motivation.This is likely to cause problems for scoring. Greater scorer reliability will probably be achieved if the question is followed by:_____ was more closely associated with _____.2. Make comparisons between candidates as direct as possibleThis reinforces the suggestion already made that candidates should not be given a choice of items and that they should be limited in the way that they are allowed to respond.Scoring the compositions all on one topic will be more reliable than if the candidates are allowed to choose from six topics, as has been the case in some well-known tests.3. Provide a detailed scoring keyThis should specify acceptable answers and assign points for partially correct responses. For high scorer reliability the key should be as detailed as possible in its assignment of points. It should be the outcome of efforts to anticipate all possible responses and have been subjected to group criticism. (This advice applies only where responses can be classed as partially or totally …correct‟, not in the case of compositions, forinstance.)4. Train scorersThis is especially important where scoring is more subjective. The scoring of compositions, for example, should hot be assigned to anyone who has not learned to score accurately compositions from past administrations. After each administration, patterns of scoring should be analyzed. Individuals whose scoring deviates markedly and inconsistently from the norm should not be used again.5. Agree acceptable responses and appropriate scores at outset of scoringA sample of scripts should be taken immediately after the administration of the test. Where there are compositions, archetypical representatives of different levels of ability should be selected. Only when all scorers are agreed on the scores to be given to these should real scoring begin.For short answer questions, the scorers should note any difficulties they have in assigning points (the key is unlikely to have anticipated every relevant response), and bring these to the attention of whoever is supervising that part of the scoring. Once a decision has been taken as to the points to be assigned, the supervisor should convey it to all the scorers concerned.6. Identify candidates by number, not nameScorers inevitably have expectations of candidates that they know.Except in purely objective testing, this will affect the way that they score. Studies have shown that even where the candidates are unknown to the scorers, the name on a script (or a photograph) will make a significant difference to the scores given.e.g.A scorer may be influenced by the gender or nationality of a name into making predictions which can affect the score given. The identification of candidates only by number will reduce such effects.7. Employ multiple, independent scoringAs a general rule, and certainly where testing is subjective, all scripts should be scored by at least two independent scores. Neither scorer should know how the other has scored a test paper. Scores should be recorded on separate score sheets and passed to a third, senior, colleague, who compares the two sets of scores and investigates discrepancies. Reliability and validityTo be valid a test must provide consistently accurate measurements. It must therefore be reliable. A reliable test, however, may not be valid at all.For example, as a writing test we might require candidates to write down the translation equivalents of 500 words in their own language. This could well be a reliable test; but it is unlikely to be a valid test of writing.In our efforts to make tests reliable, we must be wary of reducing their validity. This depends in part on what exactly we are trying to measure by setting the task. If we are interested in candidates‟ ability to structure a composition, then it would be hard to justify providing them with a structure in order to increase reliability. At the same time we would still try to restrict candidates in ways which would not render their performance on the task invalid.There will always be some tension between reliability and validity. The tester has to balance gains in one against losses in the other.。

相关文档
最新文档