Language Testing Week 3 Reliability
应用语言学语言测试理论知识点整理
应用语言学语言测试理论知识点整理在应用语言学领域,语言测试理论是一个重要的分支,它对于评估语言学习者的语言能力、指导教学实践以及推动语言教育的发展都具有关键意义。
以下将对应用语言学语言测试理论的一些重要知识点进行整理。
一、语言测试的定义与目的语言测试是对语言学习者的语言能力进行测量和评估的一种手段。
其主要目的包括:1、为教育决策提供依据,例如确定学生的升级、留级或毕业。
2、评估教学效果,帮助教师了解教学方法的有效性和学生的学习进展。
3、为学生提供反馈,让他们了解自己的语言水平和不足之处,以便进一步改进学习策略。
二、语言测试的类型1、水平测试(Proficiency Test)旨在测量考生对某种语言的整体掌握程度,不考虑考生之前的学习经历或特定的课程内容。
常见的水平测试如雅思(IELTS)、托福(TOEFL)等。
2、成绩测试(Achievement Test)侧重于检测考生在特定课程或学习阶段所掌握的语言知识和技能,与教学内容紧密相关。
比如学校的期末考试、单元测验等。
3、诊断测试(Diagnostic Test)主要用于发现考生在语言学习中存在的具体问题和薄弱环节,以便为后续的教学和学习提供针对性的指导。
4、潜能测试(Aptitude Test)预测考生学习语言的潜力和能力,而非对现有语言水平的评估。
三、语言测试的质量评估标准1、效度(Validity)指测试能够准确测量出其所要测量的语言能力或语言知识的程度。
效度分为内容效度、结构效度、预测效度等。
内容效度:测试内容是否涵盖了所要考查的语言技能和知识点。
结构效度:测试结果是否与语言能力的理论结构相一致。
预测效度:测试成绩能否有效地预测考生在未来语言学习或实际语言运用中的表现。
2、信度(Reliability)反映测试结果的稳定性和一致性。
包括重测信度、复本信度、分半信度等。
重测信度:对同一批考生在不同时间进行相同测试,两次测试结果的相关性。
复本信度:使用两份内容相似但不完全相同的试卷对同一批考生进行测试,两次结果的相关性。
语言发展测验第三版TOLD―3
语言发展测验第三版TOLD―3语言发展测验第三版TOLD―3是一套用于测量儿童语言发展的工具,它以3-21岁的儿童为单位进行评估。
TOLD―3的设计覆盖了广泛的语言技能,其中包括言语开发、听觉过程、和口头表达能力。
该评估一般采用在心理检测中最常用的填空表达式做为指导,使被评估者能够进行容易理解,乐于参与的评估过程。
一、简介TOLD―3是由Ronald D. Mathers、Pearl O. Mathers和Ralph M. Reiff于1992年开发的语言发展测验,其目的是为了测量幼儿语言技能的发展水平。
它可以用于诊断儿童发育性失败以及发现语言发展上的异常。
TOLD―3可以扩展家庭语言环境的分析,从而提供一个全面的语言发展调查。
二、评估方法TOLD―3通常由两个部分组成,第一部分是调查教师报告,和家长报告,即基于家庭和社会環境和文化准备;第二部分是直接测试,由亲子活动以及语言语法测量组成。
除了分析报表之外,TOLD―3采用已被证明是有效的语言评估和发展调查技术,如心理词典测验、实验室中的特定游戏和言语行为仿真测验。
三、结果TOLD―3提供的结果是深入的、可信的和可比较的,它使得家庭可以建立一个有用的参考框架来衡量孩子在各个语言技能领域的发展水平以及语言发展的进步情况。
TOLD―3的结果可以帮助家庭成员、教师以及心理咨询师与家庭成员一起建立一个特定的计划来帮助被测量者改善语言发展能力。
四、广泛使用TOLD―3已经被广泛使用,廣泛運用于全球各地的學校,該工具已被刊登在著名的心理学、语言学和教育学期刊上,被视为一个有效的量化测量工具。
该评估的主要优势在于它的设计方法易于使用,而且成本不高,多种测试领域的结果可以一目了然,可以清楚地呈现出被测者在某一方面发展过程中所处的位置。
五、适应年龄TOLD―3与其他常用的语言发展测验不同,它被设计为针对3-21岁儿童的语言发展,而且每个测试领域都有针对不同儿童年龄段设置的评估工具。
语言测试理论
语言测试理论(Language Assessment)定义(definition)Anastasi (1982)认为“测试实质上是对受试者的某种能力所做的客观的标准化测量”。
Carroll 则认为测试是一套程序,旨在诱发受试者的行为反应, 并以此推导出他的相关特征(a procedure designed to elicit certain behavior from which one can make inferences about certain characteristics of an individual)。
二、英语语言测试理论(一)英语语言测试类型Hughes(1989:9-19)依据测试目的、测试方法和方式、测试题型、测试成绩判别标准和判卷标准进行分类, 将英语测试分为五大类。
1.依据测试目的进行分类( 1 )水平测试(proficiency test) 语言水平测试是为了测试人们语言能力而设计的。
( 2 )学业成绩测试(achievement test)学业成绩测试是用来考查被试在学习英语某一阶段或最终阶段的成功程度。
(3)诊断测试(diagnostic test):诊断测试是用来鉴别学生的优势和不足之处,用来确定什么样的教学是必要的。
(4)能力测试(aptitude test)能力测试不以任何教学大纲为基础,目的在于检验测试者是否具备了学习某种语言的潜力。
2.依据测试方法和方式进行分类(1)直接测试(direct testing)直接考察学生某一方面语言能力的测试称为直接测试。
(2)间接测试(indirect testing)间接测试即通过测试某一技能所具备某种能力来发现学生这方面的语言能力。
3.依据测试题型进行分类:分散点测试指每次只测试一个项目的测试,每道试题只测试某一特定的语法结构等,属于间接测试。
4.依据测试成绩判别标准进行分类(1)常模参考型测试(norm-referenced testing)(2)标准参考型测试(criterion-referenced testing)以某种特定的语言能力标准作为判别标准的测试称为标准参考测试。
软件测试领域常用英文专业术语
一、按测试类型中文名称英文名称1冒烟测试smoke testing2功能测试functional testing3UI测试user interface testing4性能测试performance testing5自动化测试automated testing6压力测试stress testing7负载测试load testing8并发测试concurrency testing9单元测试unit test10集成测试integration test11系统测试system test12验收测试acceptance testing13回归测试regression testing14alpha测试alpha testing(非公司内部用户在公司内部的模拟环境中测试)15gamma测试gamma testing(用户在实际使用环境中测试,开发者不在现场,又名现场测试)16黑盒测试black box testing17白盒测试white box testing18灰盒测试gray box testing19随机测试ad-hoc test20兼容性测试compatibility testing 21本地化测试localizational testing 22国际化测试international testing 23可移植性测试portability testing24引导测试pilot testing25安装测试installation testing26文档测试documentation testing 27配置测试configuration test28可靠性测试reliability test29容量测试volume test30安全性测试security test31探索性测试exploratory test32增量测试incremental test33接口测试interface testing34互操作性测试interoperability testing 35维护测试maintenance testing 36健壮性测试robustness testing37静态测试static testing38敏捷测试agile testing39自底向上测试bottom -up testing40穷尽测试exhaustive testing41确认测试confirmation testing42一致性测试conformance testing二、按测试过程1需求规格说明software-requirement specification 2测试规格说明test specification3阶段测试计划phase test plan4测试计划test plan5测试套件test suit6语句覆盖statement coverage7判定覆盖decision coverage8测试案例test case9需求矩阵requirement tracking matrix10入口准则entry criteria11出口准则exit criteria12预期结果expected outcome13实际结果actual outcome14正式评审formal review15非正式评审informal review16事件日志incident logging17输入input18输出output19结果outcome20基线baseline21模块module22运行环境operational environment 23优先级priority24交付物deliverable25评审人reviewer26测试周期test circle27测试数据test data28测试环境test environment29测试执行test execution30测试项test item31测试监控test monitoring32测试对象test object33测试报告test report34测试脚本test script35测试策略test strategy36客户端client37服务器server38浏览器browser三、按bug相关1缺陷bug2缺陷报告bug report3错误error4代码code5条件condition6缺陷跟踪defeat tracking7通过pass8失败failed9内存泄漏memory leak10路径path11风险risk12崩溃crush13调试debug14部署deployment15异常exception按工具类1回放replay2因果图cause - effect graph3编译器compiler4配置管理工具configuration management tool 5每日构建daily build6错误推测erro guessing7结构化查询语句structured query language 其它1能力成熟度模型capability maturity model 2质量控制quality control3质量保证quality assurance。
英语重要单词(中国海洋大学高级英语测试理论与实践)
有效的valid 可靠的reliable可靠性reliability 痛苦的miserableBackwash--- the effect of testing on teaching and learning.反响Interaction 相互作用ambiguous 模糊不清的,引起歧义的反馈feedbackStakeholders利益相关者Practical 实际的beneficial backwashCategorized in terms of purposes/uses←Proficiency testsTo measure language proficiency, i.e. people’s ability in a language←Achievement testsTo discover how successful students have been in achieving the objectives of a course of study (期末考试)←Diagnostic testsTo diagnose students’strengths and weakness, to identify what they know and what they don’t’know.←acement tests(分级考试)To assist placement of students by identifying the stage, level or part of a teaching program most appropriate to their abilityCategorized in terms of approaches to test construction or testing techniquesDirect testing VS indirect testingRequire the candidates to perform precisely the skill we wish to measure, e.g. ask them to write when we want to test writing skills, VS ask them to identify errors in a sentenceDirect testing is preferred and we should sample widely---- two compositions in TOEFL & IELTS Discrete point testing VS integrative testing←Discrete point test “chops”language into smaller units, such as pronunciation, grammar, vocabulary etc. , and is more likely to use multiple choices ←Integrative test requires candidates to use many elements of a language to completea task, such as translation, composition, dictation or cloze testCategorized in terms of the interpretations of scores←Norm-referenced testing VS criterion-referenced testing←Reference: n. The state of being related or referred; v. To supply references to ←one candidate’s performance is related to that of other candidates and/or a norm---from pre-tested people of similar background, e.g. 60 percentile ←An individual’s performance is NOT compared to that of other candidates, but against criteriaCategorized in terms of scoring methods/processObjective(客观)testing VS subjective (主观)testingComputer adaptive testingCommunicative language testing:Types of Validity⏹Construct validity⏹Content validity内容效度A test is said to have content validity if its content constitutes a representative sample ofthe language skills, structures, etc. with which it is meant to be concerned.For example: A grammar test must be made up of items relating to the knowledge or control of grammar⏹Criterion-related validity效标关联效度the degree to which results on the test agree with those provided by some independent and highly dependable assessment of the ca ndidate’s abilityKinds of Criterion-related Validity❑Concurrent Validity (同期效度)Concurrent validity is established when the test and the criterion are administered at about the same time.* correlation coefficient (效度系数) a mathematical measure of similarity (ranging in value from 0 to 1)❑Predictive Validity (预期效度)This concerns the degree to which a test can predict candidates’ future performance.❑Construct Validity (构卷效度)A test, part of a test, or a testing technique is said to have construct validity if itcan be demonstrated that it measures just the ability which it is supposed to measure.Construct refers to any underlying (潜在的)ability (or trait) which is hypothesised(假定,设定) in a theory of language ability.⏹Validity in scoringIf a test is to have validity, not only the items but also the way in which the responses are scored must be valid⏹Face validityA test is said to have face validity if it looks as if it measures what it is supposed tomeasure.How to Make Tests More Valid❑Write explicit specification of the test about the constructs that are to be measured ❑Whenever feasible, use direct testing❑Make sure that the scoring of responses relates directly to what is being testedThe Use of Validity1. Every effort should be made in constructing tests to ensure content validity.2. Any published test should supply details of its validation, without which its validityand suitability can hardly be judged by a potential purchaser.Factors That Can Lower Validity⏹Unclear directions⏹Difficult reading vocabulary and sentence structure⏹Ambiguity in statements⏹Inadequate time limits⏹Inappropriate level of difficulty⏹Poorly constructed test items⏹Test items inappropriate for the outcomes being measured ⏹Tests that are too short⏹Improper arrangement of items (complex to easy?)⏹Identifiable patterns of answers⏹Teaching⏹Administration and scoring⏹Students⏹Nature of criterion。
语言测试学资料3
Chapter 3(第三章)The Reliability of Testing(测试的信度)•The definition of reliability•The reliability coefficient•How to make tests more reliableWhat is reliability?Reliability refers to the trustworthiness and stability of candidates‟ test results.In other words, if a group of students were given the same test twice at different time, the more similar the scores would have been, the more reliable the test is said to be.How to establish the reliability of a test?It is possible to quantify the reliability of a test in the form of a reliability coefficient.They allow us to compare the reliability of different tests.The ideal reliability coefficient is 1.---A test with a reliability coefficient of 1 is one which would give precisely the same results for a particular set of candidates regardless ofwhen it happened to be administered.---A test which had a reliability coefficient of zero would give sets of result quite unconnected with each other.It is between the two extremes of 1 and zero that genuine test reliability coefficients are to be found.How high should we expect for different types of language tests? Lado saysGood vocabulary, structure and reading tests are usually in the 0.9 to 0.99 range, while auditory comprehension tests are more often in the 0.8 to 0.89 range.A reliability coefficient of 0.85 might be considered high for an oral production test but low for a reading test.The way to establish the reliability of a test:1. Test-retest methodIt means to have two sets of scores for comparison. The most obvious way of obtaining these is to get a group of subjects to take the same test twice.2. Split-half methodIn this method, the subjects take the test in the usual way, but each subject is given two scores. One score is for one half of the test, the second score is for the other half. The two sets of scores are then used to obtain the reliability coefficient as if the whole test had been taken twice.In order for this method to work, it is necessary for the test to be spilt into two halves which are really equivalent, through the careful matching of items (in fact where items in the test have been ordered in terms of difficulty, a split into odd-numbered items and even-numbered items may be adequate).3. Parallel forms method(the alternate forms method)It means to use two different forms of the same test to measure a group of students continuously or in a very short time. However, alternate forms are often simply not available.How to make tests more reliableAs we have seen, there are two components of test reliability: the performance of candidates from occasion to occasion, and the reliability of the scoring.Here we will begin by suggesting ways of achieving consistent performances from candidates and then turn our attention to scorer reliability.1.Take enough samples of behaviorOther things being equal, the more items that you have on a test, the more reliable that test will be.e.g.If we wanted to know how good an archer someone was, wewouldn‟t rely on the evidence of a single shot at the target. That one shot could be quite unrepresentative of their ability. To be satisfied that we had a really reliable measure of the ability we should want to see a large number of shots at the target.The same is true for language testing.It has been demonstrated empirically that the addition of further items will make a test more reliable.The additional items should be independent of each other and of existing items.e.g.A reading test asks the question:“Where did the thief hide the jewels?”If an additional item following that took the form: “What was unusual about the hiding place?”Would it make a full contribution to an increase in the reliability of the test?No.Why not?Because it is hardly possible for someone who got the original questions wrong to get the supplementary question right.We do not get an additional sample of their behavior, so the reliability of our estimate of their ability is not increased.Each additional item should as far as possible represent a fresh start for the candidate.Do you think the longer a test is, the more reliability we will get?It is important to make a test long enough to achieve satisfactory reliability, but it should not be made so long that the candidates become so bored or tired that the behavior that they exhibit becomes unrepresentative of their ability.2. Do not allow candidates too much freedomIn general, candidates should not be given a choice, and the range over which possible answers might vary should be restricted.Compare the following writing tasks:a) Write a composition on tourism.b) Write a composition on tourism in this country.c) Write a composition on how we might develop the tourist industry in this country.d) Discuss the following measures intended to increase the number of foreign tourists coming to this country:i)More/better advertising and / or information (where? What formshould it take?)ii)Improve facilities (hotels, transportation, communication etc.). iii)Training of personnel (guides, hotel managers etc.)The successive tasks impose more and more control over what iswritten. The fourth task is likely to be a much more reliable indicator of writing ability than the first.But in restricting the students we must be careful not to distort too much the task that we really want to see them perform.3. Write unambiguous itemsIt is essential that candidates should not be presented with items whose meaning is not clear or to which there is an acceptable answer which the test writer has not anticipated.The best way to arrive at unambiguous items is, having drafted them, to subject them to the critical scrutiny of colleagues, who should try as hard as they can to find alternative interpretations to the ones intended. 4. Provide clear and explicit instructionsThis applies both to written and oral instructions.If it is possible for candidates to misinterpret what they are asked to do, then on some occasions some of them certainly will.A common fault of tests written for the students of a particular teaching institution is the supposition that the students all know what is intended by carelessly worded instructions.The frequency of the complaint that students are unintelligent, have been stupid, have willfully misunderstood what they were asked to do, reveals that the supposition is often unwarranted.Test writers should not rely on the students‟ powers of telepathy toelicit the desired behavior.The best means of avoiding problems is the use of colleagues to criticize drafts of instructions (including those which will be spoken).Spoken instructions should always be read from a prepared text in order to avoid introducing confusion.5. Ensure that tests are well laid out and perfectly legibleToo often, institutional tests are badly typed (or handwritten), have too much text in too small a space, and are poorly reproduced. As a result, students are faced with additional tasks which are not ones meant to measure their language ability. Their variable performance on the unwanted tasks will lower the reliability of a test.6. Candidates should be familiar with format and testing techniquesIn any aspect of a test is unfamiliar to candidates, they are likely to perform less well than they would do otherwise. For this reason, every effort must be made to ensure that all candidates have the opportunity to learn just what will be required of them. This may mean the distribution of sample tests (or of past test paper), or at least the provision of practice materials in the case of tests set within teaching institutions.7. Provide uniform and non-distracting conditions of administrationThe greater the differences between one administration of a test and another, the greater the differences one can expect between a candidate‟s performance on the two occasions.Great care should be taken to ensure uniformity.e.g.Timing should be specified and strictly adhered to;The acoustic conditions should be similar for all administrations of a listening test. Every precaution should be taken to maintain a quiet setting with no distracting sounds or movements.How to obtain scorer reliability1. Use items that permit scoring which is as objective as possibleThis may appear to be a recommendation to use multiple choice items, which permit completely objective scoring. This is not intended. While it would be mistaken to say that multiple choice items are never appropriate, it is certainly true that there are many circumstances in which they are quite inappropriate. What is more, good multiple choice items are notoriously difficult to write and always require extensive pretesting.An alternative to multiple choice is the open-ended item which has a unique, possibly one-word, correct response which the candidates produce themselves. This too should ensure objective scoring, but in fact problems with such matters as spelling which makes a candidate‟s meaning unclear often make demands on the scorer‟s judgment. The longer the required response, the greater the difficulties of this kind.One way of dealing with this is to struct ure the candidate‟s response byproviding part of it.e.g.The open-ended question What was different about the results?may be designed to elicit the responseSuccess was closely associated with high motivation.This is likely to cause problems for scoring. Greater scorer reliability will probably be achieved if the question is followed by:_____ was more closely associated with _____.2. Make comparisons between candidates as direct as possibleThis reinforces the suggestion already made that candidates should not be given a choice of items and that they should be limited in the way that they are allowed to respond.Scoring the compositions all on one topic will be more reliable than if the candidates are allowed to choose from six topics, as has been the case in some well-known tests.3. Provide a detailed scoring keyThis should specify acceptable answers and assign points for partially correct responses. For high scorer reliability the key should be as detailed as possible in its assignment of points. It should be the outcome of efforts to anticipate all possible responses and have been subjected to group criticism. (This advice applies only where responses can be classed as partially or totally …correct‟, not in the case of compositions, forinstance.)4. Train scorersThis is especially important where scoring is more subjective. The scoring of compositions, for example, should hot be assigned to anyone who has not learned to score accurately compositions from past administrations. After each administration, patterns of scoring should be analyzed. Individuals whose scoring deviates markedly and inconsistently from the norm should not be used again.5. Agree acceptable responses and appropriate scores at outset of scoringA sample of scripts should be taken immediately after the administration of the test. Where there are compositions, archetypical representatives of different levels of ability should be selected. Only when all scorers are agreed on the scores to be given to these should real scoring begin.For short answer questions, the scorers should note any difficulties they have in assigning points (the key is unlikely to have anticipated every relevant response), and bring these to the attention of whoever is supervising that part of the scoring. Once a decision has been taken as to the points to be assigned, the supervisor should convey it to all the scorers concerned.6. Identify candidates by number, not nameScorers inevitably have expectations of candidates that they know.Except in purely objective testing, this will affect the way that they score. Studies have shown that even where the candidates are unknown to the scorers, the name on a script (or a photograph) will make a significant difference to the scores given.e.g.A scorer may be influenced by the gender or nationality of a name into making predictions which can affect the score given. The identification of candidates only by number will reduce such effects.7. Employ multiple, independent scoringAs a general rule, and certainly where testing is subjective, all scripts should be scored by at least two independent scores. Neither scorer should know how the other has scored a test paper. Scores should be recorded on separate score sheets and passed to a third, senior, colleague, who compares the two sets of scores and investigates discrepancies. Reliability and validityTo be valid a test must provide consistently accurate measurements. It must therefore be reliable. A reliable test, however, may not be valid at all.For example, as a writing test we might require candidates to write down the translation equivalents of 500 words in their own language. This could well be a reliable test; but it is unlikely to be a valid test of writing.In our efforts to make tests reliable, we must be wary of reducing their validity. This depends in part on what exactly we are trying to measure by setting the task. If we are interested in candidates‟ ability to structure a composition, then it would be hard to justify providing them with a structure in order to increase reliability. At the same time we would still try to restrict candidates in ways which would not render their performance on the task invalid.There will always be some tension between reliability and validity. The tester has to balance gains in one against losses in the other.。
语言测试理论
【下载本文档,可以自由复制内容或自由编辑修改内容,更多精彩文章,期待你的好评和关注,我将一如既往为您服务】语言测试理论(Language Assessment)定义(definition)Anastasi (1982)认为“测试实质上是对受试者的某种能力所做的客观的标准化测量”。
Carroll 则认为测试是一套程序,旨在诱发受试者的行为反应, 并以此推导出他的相关特征(a procedure designed to elicit certain behavior from which one can make inferences about certain characteristics of an individual)。
二、英语语言测试理论(一)英语语言测试类型Hughes(1989:9-19)依据测试目的、测试方法和方式、测试题型、测试成绩判别标准和判卷标准进行分类, 将英语测试分为五大类。
1.依据测试目的进行分类( 1 )水平测试(proficiency test) 语言水平测试是为了测试人们语言能力而设计的。
( 2 )学业成绩测试(achievement test)学业成绩测试是用来考查被试在学习英语某一阶段或最终阶段的成功程度。
(3)诊断测试(diagnostic test):诊断测试是用来鉴别学生的优势和不足之处,用来确定什么样的教学是必要的。
(4)能力测试(aptitude test)能力测试不以任何教学大纲为基础,目的在于检验测试者是否具备了学习某种语言的潜力。
2.依据测试方法和方式进行分类(1)直接测试(direct testing)直接考察学生某一方面语言能力的测试称为直接测试。
(2)间接测试(indirect testing)间接测试即通过测试某一技能所具备某种能力来发现学生这方面的语言能力。
3.依据测试题型进行分类:分散点测试指每次只测试一个项目的测试,每道试题只测试某一特定的语法结构等,属于间接测试。
交际语言测试的基本理论与评估原则
交际语言测试的基本理论与评估原则敏≮if.2007.9(上旬刊)交际语言测试的墓牵理论与许结原则口洪丽燕(黄冈师范学院外国语学院大学英语教学部湖北?黄冈438000)摘要有教学必然有测试,本文探讨了交际语言测试的发展.交际能力的内涵和交际测试的评估原则,对我国广大外语教师有着重要的借鉴作用.关键词交际语言测试信度效度真实性中图分类号:HO文献标识码:A语言测试衡量学生对某一语言掌握的程度或所达到的水平.其作为--1'3学科,主要研究语言测试的原则,规律,内容,设计,评估及结果分析等方面,具有多科性的基础.作为一名语言教师,若要使自己的试题设计具有科学性,合理性,学习了解语言测试的基本理论与其评估标准是不无必要的.一,语言测试的体系的变迁纵观外语教学测试的历史,李筱菊从发展的角度提出了三代不同的测试体系:科学前测试体系,结构主义测试体系和交际测试体系.第一代体系在测试上体现为科学前语言测试.第一代外语教学和测试体系的语言观的内涵是语言是一套知识.到了2O世纪4O年代,以美国语言学家Bloomfield(1933)和Fries(1945),Lado(1957)等为代表的结构主义语言学派,在测试方法上吸取了心理学领域的心理测量学的科学方法,形成了心理测量学一结构主义语言学测试.从7O年代开始,以Savignon(1972)和Widdowson(1972)等为代表的语言学家提出了新的语言教学体系一交际语言教学.这个体系认为学语言不仅仅是学语音,语法,词汇知识,也不仅仅是训练操作形式符号的技能,而是获取人与人交际的一种能力.二,交际语言测试的基本理论:语言交际能力交际语言测试理论的核心是交际能力的学说,不同的语言学家对交际能力的阐述,代表了交际语言测试的三个发展阶段.1972年Hymes首次提出交际能力包括四个方面:可能性,可行性,确当性及有效性.CanaleandSwain(1980)~t1认为交际能力包括语法能力,社会语言能力,语篇能力和策略能力四个方面的知识和技能.2O世纪9O年代初,Bachman提出了新的交际能力理论模式. Bachman认为,语言交际能力就是把语言知识和语言使用的场景特征结合起来的能力,由三部分组成:语言能力,策略能力和心理一生理机制.Bachman的交际测试理论不仅涵盖了这两大问题,而且提出了语言测试的"真实性程度"问题,把它作为开发,评价一项测试时的标准.三,交际语言测试的评价原则:信度,效度和真实性原则近年来随着测试理论和实践的发展,Bachman和Palmer(1996)提出了语言测试设计和评价的"有用性"原.~J](usefulness).这里我们主要讨论语言测试的信度,效度和真实性原则.1.信度(reliability)语言测试的信度是指测试结果的可靠性和稳定性.测试的信度高低,受试题的量和质,考试实施,评卷三方面的因素所牵制:试题要有足够的量,确保试题区分度高,难度适中,适宜于受试群;考试实施的各种条件对所有受试者应当一致;关于评分标准,要求评分员之间保持一致(inter—raterconsistency),也要求每个评分员自身保持前后一致(intra—raterconsistency).对考试信度的验证,测试管理者还可以通过以下方法进行评估:(1)试题分半法:考后将试题号按奇数偶数分为两半,计算两半所得分数的高低排列的相关;(2)考后复考法,同一套试题让同一个受试群在正式考后短时间内,再考一次,计算两次受试者分数高低文章编号:1672—7894(2007)09—225—01排序的相关;f3)评分一再评分法:在同一标准下两位教师对同一试卷进行评分,或同一教师对试卷进行两次或两次以上的评分;(4)信度系数公式评估法,指对测试的项目和其组成部分之间的一致性程度的测试.2.效度(validity)效度.又称有效性,它是指一套测试所考的是否就是设计人想要考的内容.(1)内容效度(ContentValidity).内容效度是指考试的内容是否具有代表性和综合性,或者说是否考了应考的内容.内容效度的确定, 一般不靠统计手段,而是命题人员或审题人员对试卷的内容,题目的难易度,区分度等进行严格的分析.(2)结构效度(ConstructValidity).结构效度指测试是否以有效的语言观(包括语言学习观和语言运用观)为依据.一项测试的结构效度的高低是指考试的结果能在多大程度上解释人的语言能力及与语言能力相关的心理特征.(3)预测效度(PredictiveValidity)和共时效度(ConcurrentValidity).预测效度是指考试的结果和预言是否有效.一份具有很好的预测效度的试卷,应该能够正确地预言学生未来的行为;共时效度是用来将新的考试和已经公认的考试作比较,以便证明新的考试的效度.3.真实性(authenticity)Bachman(1991)提出应该从两方面来定义测试的真实性:(1)情景真实性,指测试方法特征与将来某一特定目的与使用的情景特征相关的程度;f2)交际真实性,指考生在完成某一测试任务时,其语言能力的哪些方面参与了完成该测试任务的活动,参与的程度如何.语言测试的真实性这一标准有助于我们设计考题时打开思路,评估试题具有新的角度,提高测试的真实性和可信度.Bachmma还提出了用以提高语言测试交际真实性的四项措施:第一,提出要求.在设计考题时可以具体说明考生只有使用何种策略才能完成该任务.第二,提供机会.即给考生提供充足的时间,必要的信息和工具等.第三,考试任务要得当.任务太难,会影响考生策略的应用.第四,考试任务要由趣味性.通过提高考试任务的情景真实性可以提高考试任务的趣味性.交际测试法是迄今为止较为科学完善的外语测试方法,交际测试将在21世纪成为外语测试的主流.我们广大外语教师应当投身于外语测试的改革当中,从我国外语教学和测试的实际出发,借鉴和发展国外的交际测试理论,使测试真正为教学服务,不断提高语言教学的质量.参考文献:【1]Baehman,LyleF.FundamentalConsiderationsinI..anguageTestingOxford: OxfordUniversityPress,1990.[2]Bachman,LF.&AdrianS.PalmerLanguageTestinginPracticeOxford: OxfordUniversityPress,1996.[3】支润青,韩宝成.语言测试和它的方法E京:外语教学与研究出版社,2000. f41李筱菊.语言测试科学与艺术.湖南:湖南教育H{版社,2001.【5】徐强.交际法英语教学和考试评估.上海:上海外语教育H{版社,2000. f6】邹申.英语语言测试理论与操作.上海:上海外语教育出版社,1998.225。
英语教学与测试Language Testing
Test-2
测试是用来获取某些行为的方式、
方法,其目的是从这些行为中推断 个人具有的某些特征。
Test-3
Anastasi (1962): “ 测试实质上是对行为样本所做的客
观的标准化的测量。” 三要素:
* 行为样本 * 客观的测量 * 标准化的测量
(刘润清:P4)
Measurement 1
Bachman (1999):
2. Approaches to language testing
(1) J.B.Heaton:
﹡the essay-translation approach 写作翻译法 ﹡the structuralist approach 结构主义法 ﹡the integrative approach 综合法 ﹡the communicative approach 交际法
communicativelanguageabilityknowledgestructuresknowledgeoftheworld知识结构关于世界的知识languagecompetenceknowledgeofthelanguage语言能力关于语言的知识strategiccompetence策略能力psychophysiologicalmechanisms心理生理机制contextofsituation语言使用环境bachman的语言交际能力的各个组成部分语言能力语言组织能力语用能力语法能力语篇能力语义能力功能能力社会语言能力句法词法语音修辞结构词语联结语义特性字面意思隐含意思达意操纵探索想象对方言和变体的语感对语域差别的语感理解和使用文化典故和比喻的能力对自然地道语的语感情景评估目标用特定的功能形式和内容理解或表达言语语言能力语言组织能力语用能力心理生理机制制定计划过程从语言知识库中取材料计划组织材料以期导向交际目标实施神经的和生理的过程话语表达或理解语言bachman的语言使用模式3
语言测试的信度.ppt
1、重测信度
• 定义:指用同一个测验对同一组被试施测 两次所得结果的一致性程度。重测信度表 示两次测验结果有无变动,反映了测验分 数的稳定程度。
• 适用情况:如性格等相对稳定的心理特质。
• 优点:
– 对于出题者来说,十分方便。
• 缺点:
– 计算重测信度有较为严格的前提假设:
• 所测量的特性是稳定的; • 被试的练习效果和遗忘效果相互抵消; • 两次施测之间,被试的学习效果没有差异。
语言测试的信度
Reliability of Language Test
内容大纲
一.信度的定义 二.信度的种类
三.信度的估计参数 四.使用SPSS计算各种信度
五.影响信度的因素 六.提高信度的方法
• 用一杆坏的弹簧秤。。。。
– 结果稳定吗?结果可信么?
• 一把钢尺Vs.一把皮卷尺,
– 由同样的20个人去量桌子的长度。使用哪 种工具的时候,20个人测量出的结果会比 较一致稳定?
心理测试
语言测试
一、什么是信度?
• 简单的来说,信度就是测量的稳定性和 一致性程度。
为什么测量结果会不稳定和不一致? 观察值=绝对真值+误差
二、信度的种类
1. 重测信度 (test-retest method) 2. 复本信度(alternate-form
reliability) 3. 分半信度(spilt-half reliability) 4. 同质性信度(homogeneity reliability) 5. 评分者信度(scorer reliability)
– Eg:高考的作文阅卷,求职面试
练习题
• 一位老师为社区大学学生入学设计了一个复杂的数学测 验。测验包含了以下维度的多重选择题:阅读公式、数 学运算、解决字符问题。这位老师只能施测一次,但他 想要知道这个测验的信度如何,他应该怎么做?
语言测试(测试有用性)
interpretation, we need to provide evidence that the test score reflect the area(s) of language ability we want to measure. For our purpose, we can consider a construct to be the specific definition of an ability that provides the basis for a given test or test task and for interpreting scores derived from this task. The term construct validity is therefore used to refer to the extent to which we can interpret a given test score as an indicator of the ability(ies), or construct(s) we want
1forcedchoiceitems强制选择类2constructedresponseitems构建答案类3pureitemshybriditems单纯类和混合integrativeitems分立式和综合式5gapfillingitemscompletionitems篇填空题和语句填空题6discreteitemstextbaseditems分离试题和语篇依附试题1forcedchoiceitems强制选择类强制选择类试题也被称作选择答案类试题或固定选择类试题要求应试者在若干选项中选取答案
to measure. 2) Relationship ห้องสมุดไป่ตู้etween reliability and construct validity The primary purpose of a language test is to provide an interpretation of test scores as an indicator of an individual‟s language ability. The two measurement qualities, reliability and construct validity are thus essential to the usefulness of any language test. Reliability is a necessary condition for construct validity, and hence for usefulness.
语言测试的效度与信度
• •
3.效度在考后阶段的考虑
在语言测试的设计和使用中,效度问题即我们对测试 成绩的解释和 使用是有效的--被称为是保证测试质量的 不可或缺的重要指标。因此, 在考后阶段,应该对考试成 绩做描述性的分析和解释,包括制定恰当 的及格分数线, 并针对各分数段所能达到的能力进行解释。说明,以 便决 策部门正确使用成绩。根据成绩做出的决策影响面颇广, 上至 政府用人部门,中至招聘职员的公司企业,下至受试 者个人,可谓一 把“双刃剑”。在大规模、高风险的选拔人 才的考试中如全国性的大 学入学考试,合理有效地使用考 试成绩,就能选拔出对国家政治经济 发展有用的人才。否 则,依据没有效度保证的测试成绩或不合理使用 测试成绩, 就达不到选拔人才的目的,给国家造成重大的损失。对使 用测试成绩做任用决策的公司企业,有效的成绩可以招聘 对其事业发 展有用的人才。反之,结果则显而易见。对受 试者个人来说,影响也 十分重大。他们有可能被成就理想, 从此走向光明的人生之路;也有 可能被错误淘汰,造成经 济上、时间上和心理上的巨大损失。在考后 阶段,对成绩 的解释和使用应该十分慎重,必须把此成绩同以往的成 绩 进行对比分析,征求命题人员、测试专家以及考生的意见, 利用 多种手段对测试进行效度分析。惟其如此,才能对用 人单位和受试个 人负责,使测试工作自始至终有高效度质 量的保证。
一、信度(Reliability)
• Lyle F.Bachman把信度定义为“测试结果的一致性”。 换 句话说,有信度的试题应在任何时间、地点下通过测试 都能够 得到一致的结果。试题的信度可以通过比较两套试 题结果或试 题内容来获得,如果它们的结果接近或一致, 那么说明试题是 有信度保证的。评估测试结果的一致性可 依据许多方法加以衡 量。例如,在传统真实分数测试理论 (classical true score measurement theory)模式中有三种衡量 信度的方法,每一种方 法针对不同的误差源(sources of er- ror):试卷内容一致性评估 方法(internal consistency)主要 关注来自于试题内容和评分过程 中的偏差问题;稳定性 (stability)评估方法指出同一试题在相隔 一段时间之后给 同一组测试对象测试的结果的一致性问题;对 等性(equtv- alence)评估方法提供两套试题结果之间一致性程度 的信度 系数。但是测试结果的可靠性还要受到诸多其他因素的 影 响,测试成绩的高低虽然很大程度上取决于受试者语言水 平 的高低,但是同时不可忽视的是测试成绩又受到测试方 法、受 试者个性特征,诸如认知风格、知识范围、情感因 素、性别、 民族以及诸多不可预见因素的影响。
信度检验 英语
信度检验英语全文共四篇示例,供读者参考第一篇示例:信度检验(Reliability Test)是评估某个测试工具或问卷的稳定性和可靠性的一种统计分析方法。
在心理学、教育学、医学等领域,信度检验被广泛应用于评估各种测量工具的信度,以确保其能够稳定地反映被研究对象的特征或状态。
在进行信度检验时,常用的统计方法包括Cronbach's Alpha系数、Kuder-Richardson系数等。
这些系数可以帮助研究者评估测试工具的内部一致性,即所测量的各项指标之间的关联程度。
Cronbach's Alpha系数通常用于评估一个问卷中各项问题之间的内部一致性,其取值范围在0到1之间,数值越接近1说明问卷的信度越高。
除了内部一致性外,信度检验还可以评估测试工具的重测信度。
重测信度是指在多次测试中,同一个被测量对象在不同时间或环境条件下的得分变化。
通过重复测量同一对象,研究者可以评估测试工具在不同情况下的信度表现,以进一步确认其可靠性。
在进行信度检验时,研究者需要注意以下几点:要确保测量工具具有良好的设计和结构,以确保其测量项目之间有较高的相关性;要选择适当的统计方法进行信度检验,以确保结果的可信度和准确性;要对信度检验结果进行解释和分析,以确定测试工具的信度是否符合研究需求。
信度检验是评估测试工具可靠性的重要手段,可以帮助研究者确保测量工具的稳定性和准确性,提高研究结果的可信度和科学性。
在实际研究中,研究者应充分重视信度检验的重要性,以确保其研究结论的有效性和可靠性。
第二篇示例:信度检验(英语:reliability test)是指对某种测量或评估工具的一种统计分析方法,用来评估该工具的稳定性和一致性。
信度检验是研究者在进行研究时非常重要的一项工作,因为一个具有高信度的测量工具可以提高研究结果的可靠性和准确性。
在研究设计阶段,研究者需要对所使用的测量工具进行信度检验,以确保所得到的研究结果能够反映出研究对象的真实情况。
language testing
Reflection on Language TestingThis reflection is arranged with eight parts to illustrate the definition, design, and procedure, rating process, validity, measurement, social characters and new directions concerning language testing.Part 1: What is language testing?Testing is a universal feature of social life. In fact, as so often happens in the modern world, this process, which so much affects our lives, becomes the province of experts and we become dependent on them. There are some reasons why people working in the field of language study need the knowledge about language testing. Firstly, language tests play a powerful role in many people’s lives, acting as gateways at important transitional moments in education, in employment, and in moving from one country to another. Secondly, you may be working with language tests in your professional life as a teacher or administrator, teaching to a test, administrating tests, or relying on information from tests to make decisions on the placement of students on particular course. Finally, if you are conducting research in language study you may need to have measures of the language proficiency of your subjects etc.Not language tests are of the same kind. They differ with respect to how they are designed, and what they are for: in other words, in respect to test method and test purpose. In terms of method, we can broadly distinguish traditional paper-and-pencil language tests from performance tests. Paper-and-pencil tests take the form of the familiar examination question paper. In performance based tests, language skills are assessed in an act of communication.Language tests also differ according to their purpose. In facts, the same form of test may be used for differing purposes, although in other cases the purpose may affect the form. The most familiar distinction in terms of test purpose is that between achievement and proficiency tests. Achievement tests are associated with the process of instruction and accumulate evidence during, or at the end of, a course of study in order to see whether and where process has been made in termsof the goal of learning. Alternative assessment stresses the need for assessment to be integrated with the goals of the curriculum and to have a constructive relationship with teaching and learning.Testing is about making inferences; this essential point is obscured by the fact that some testing procedures, particularly in performance assessment, appear to be involve direct observation. There are a number of other limits to the authenticity of tests, which force us to recognize an inevitable gap between the test and the criterion. Validity is another important fact in language testing and it involves two things—one involves understanding how, in principle, performance on the test can be used to infer performance in the criterion; the other involves using empirical data from test performances to investigate the defensibility of that understanding and hence of the interpretations(the judgments about test-takers) that follow from it.Testing thus necessarily involves interpretation of the data of test performance as evidence of knowledge or ability of one kind or another.Part 2: Communication and the design of language tests Essential to the activities of designing tests and interpreting the meaning of test scores in the view of language and language use embodied in the test. The term test construct refers to those aspects of knowledge or skill possessed by the candidate which are being measured. Defining the test construct involves being clear about what knowledge of language consists of, and how that knowledge is deployed in actual performance (language use).There was a tendency to atomize and decontextualize the knowledge to be tested, and to test aspects of knowledge in isolation. The practice of testing separate, individual points of knowledge, known as discrete point testing, was reinforced by theory and practice within psychometrics, the emerging science of the measurement of cognitive abilities. This stressed the need for certain properties of measurement, particularly reliability, or consistency of estimation of candidates’ abilities.The discrete point tradition of testing was seen as focusing too exclusively onknowledge of the formal linguistic system for its own sake rather than on the way such knowledge is used to achieved communication. The American John Oller offered a new view of language and language use underpinning test, focusing less on knowledge of language and more on the psycholinguistic processing involved in language use. Language use was seen as involving two factors:(Ⅰ)the on-line processing of language in real time (for example, in naturalistic speaking and listening activities), and (Ⅱ)a ‘pragmatic mapping’ component, that is, the way formal knowledge of the systematic features of language was drawn on for the expression and understanding of meaning in context.From the early 1970s, a new theory of language and language use began to exert a significant influence on language teaching and potentially on language testing. This was Hymes’s theory of communicative competence, which greatly expanded the scope of what was covered by an understanding of language and the ability to use language in context, particularly in terms of the social demands of performance. Hymes saw that knowing a language was more than knowing its rules of grammar. There were culturally specific rules of use which related the language used to features of the communicative context. Communicative language tests ultimately came to have two features:(Ⅰ)They were performance tests, requiring assessment to be carried out when the learner or candidate was engaged in an extended act of communication, either receptive or productive, or both. (Ⅱ)They paid attention to the social roles candidates were likely to assume in real world settings, and offered a means of specifying the demands of such roles in detail.Various aspects of knowledge or competence were specified in the early 1980s by Michael Canale and Merrill Swain in Canada:(Ⅰ)grammatical or formal competence, which covered the kind of knowledge( of systematic features of grammar, lexis, and phonology) familiar from the discrete point tradition of testing; (Ⅱ)sociolinguistic competence, or knowledge of rules of language use in terms of what is appropriate to different types of interlocutors, in differentsettings, and on different topics; (Ⅲ)strategic competence, or the ability to compensate in performance for incomplete of imperfect linguistic resources in a second language; and (Ⅳ) discourse competence, or the ability to deal with extended use of language in context.Part3: The testing cycleDesigning and introducing a new test is a little like getting a new ca on the road. It involves a design stage, a construction stage, and a try-out stage before the test is finally operational. But what suggests a linear process, whereas in fact test development involves a cycle of activity, because the actual operational use of the test generates evidence about its own qualities.Before they begin thinking in detail about the design of a test, test develops will need get the lay of the land, that is, to establish the constraints under which they are working ,and under which the test will be administrated.Establishing test contents involves careful sampling from the domain of the test, that is, the set of tasks or the kinds of behaviors in the criterion setting, as informed by our understanding of the test construct. The next thing to consider in test design is the way in which candidates will be required to interact with the test materials.An alternative to grappling with the dilemma of authenticity of response involves accepting to a greater or lesser degree the artificiality of the test situation, and using a range of conventional and possibility inauthentic test formats.The result of the design process in terms of test content and test method is the creation of test specifications. The specifications will include information on such matters as the length and structure of each part of the test, the type of materials with which candidates will have to engage, the source of such materials if authentic, the extent to which authentic materials may be altered, the response format, the test rubric, and how responses are to be scored. The fourth stage is trialing or trying out the test materials and procedure prior to their use under operational conditions. Throughout the testing cycle data for the investigation of test qualities are generated automatically in the form of responses to test items.Part4: The rating processThere are three mainly aspects of the validation of rating procedures: the establishing of rating protocols; exploring differences between individual raters, and mitigating their effects; and understanding interactions between raters and other features of the rating process.Establishing a rating procedure contains three main aspects. First, there is agreement about the conditions (including the length of time) under which the person’s performance or behavior is elicited , and/or is attended to by the rater. Second, certain features of the performance are agreed to be critical; the criteria for judging these will be determined and agreed. Finally, raters who have been trained to an agreed understanding of the criteria characterize a performance by allocating a grade or rating.There also exist some problems with raters. Not only of the quality of the performance, but of the qualities as a rater of the person who has judged it. Rating is essentially reduced to a process of the recognition of objective signs, with classification following automatically. In establishing a rating procedure, we need to consider the criteria by which performances at a given level will be recognized, and then to decide how many different levels of performance we wish to distinguish. The answers to these questions will determine the basic framework or orientation for the rating process. Deciding which of the these orientations best fits a particular assessment setting will depend on the context and purpose of the assessment.Most often, frameworks for rating are decided as scales, as this allows the greatest flexibility to the users, who may want to use the multiple distinctions available from a scale, or who may choose to focus on only one cut-point or region of the scale.Performances are complex. Judgment of performances involves balancing perceptions of a number of different features of the performance.In terms of relative severity, or a consistent tendency to see a particular performance as narrowly demonstrating or narrowly failing to demonstrateachievement at a particular performance level. Usually, the psychological pressure of embarrassing over having given ratings out of line with those of others is sufficient to get raters to reduce their differences. Ongoing monitoring of rater performance is clearly necessary to ensure fairness in the testing process.Part5: Validity: testing the testTesting is a matter of using data to establish evidence of learning. But evidence does not occur concretely in the natural state, so to speak, but is an abstract inference. It is a matter of judgment. There are two stages in language testing to ensure the defensibility and fairness of interpretation based on test performance. There are certain differences between the two contexts. First, legal cases usually involve an individual accused; test validation looks at the procedures as a whole, for all the candidates affected by them. Secondly, in the case of a crime, the picture being formed in the minds of the police concerns something that has already happened, that is, it is retrospective.Test content is an issue which is the extent to which the test content forms a satisfactory basis for the inferences to be made from test performance. The issues arising in such contexts are issues of what is known as content-related validity or, more traditionally, content validity. Judgments as to the relevance of content are often quite complex, and the validation effort is accordingly elaborate.We saw that the most commonly used methods involve considerable compromise on the authenticity of the test, so that the gap between test performance and performance in the criterion may, on the face of it, appear quite wide. In general, tests may introduce factors that are irrelevant to the aspects of ability being measured ( construct irrelevant variance); or they may require too little of the candidate (construct under-representation). In the last decades, a renewed theory of test validation has expanded the scope of validation research to include the change that may occur as a consequence of their introduction.Part6: MeasurementMeasurement investigates the quality of the process of assessment by looking at scores. Two main steps are involved: 1.Quantification, that is, the assigning ofnumbers or scores to various outcomes of assessment. The set of scores available for analysis when data are gathered from a number of test-takers is known as a data matrix.2.Checking for various kinds of mathematical and statistical patterning within the matrix in order to investigate the extent to which necessary properties(for example, consistency of performance by candidates, or by judges)are present in the assessment.As investigation of rater agreement depends on the comparison of ratings, the first step involves careful data collection. Mathematical methods for establishing predicable numerical relations of this kind originated in the rather prosaic field of agriculture, in order to explore the predicative relationship between varying amounts of fertilize and associated crop yields.Items analysis usually provides two kinds of information on items: item facility, which helps us decide if test items are at the right level for the target group, and item discrimination, which allows us to see if individual items are providing information on candidates’ abilities consiste nt with that provided by the other items on the test. Item facility expresses the proportion of the people taking the test who got a given item right. Analysis of item discrimination addresses a different target: consistency of performance by candidates across items.An alternative approach which does not use a comparison between individuals as its frame of reference is known as criterion-referenced measurement. New measurement approaches continually emerge. The most significant of them is known by the general name of Item Response Theory (IRT).IRT represents a new approach to item analysis.Part7: The social character of language testThe individualized and individual focus of traditional approaches described so far is really rather surprising when we consider the inherently institutional character of assessment. When test reforms are introduced within the educational system, they are likely to figure prominently in the press and become matters of public concern.Language tests have a long history of use as instruments of social andcultural exclusion. One of the earliest recorded instances is the shibboleth test, mentioned in the Old Testament. Following a decisive military battle between two neighboring ethnic groups, members of the vanquished group attempted to escape by blending in with their culturally and linguistically very similar victors.In international education, tests are used to control access to education opportunities. Typically, international students need to meet a standard on a test of language for academic purposes before they are admitted to the University of their Choice.The polices and practices throw up a host of questions about fairness, and about the policy issues surrounding testing practice. They also raise the question of the responsibilities of language testers.Those who advocate the position of socially responsible language testing reject the view that language testing is merely a scientific and technical activity. Generally, this expanded sense of responsibility sees ethical testing practice as involving test developers in taking responsibility for the effects of tests.In contrast to those advocating the direct social responsibility of the tester, a more traditional approach involves limiting the social responsibility of language testers to questions of the professional ethics of their practice.Part8: New direction—and dilemmas?We live in a time of contradictions. The speed and impressiveness of technological advance suggest an era of great certainty and confidence. Yet at the same time current social theories undermine our certainties, and have engendered a profound questioning of existing assumptions about the self and its social constructionRapid developments in computer technology have had a major impact on test delivery. Already, many important national and international language tests, including TOEFL, are moving to computer based testing (CBT).The proponents of computer based testing can point to a number of advantages. First, scoring of fixed response items can be done automatically, and the candidates can be given a score immediately. Second, the computer candeliver tests that are tailored to the particular abilities of the candidate.Tape recorders can be used in the administration of speaking tests. Candidates are presented with a prompt on tape, and are asked to respond as if they were talking to a person, the response being recorded on tape.The speed of technological advances affecting language testing sometimes gives an impression of a field confidently moving ahead, notwithstanding the issues raised. But concomitantly the change in perspective from the individual to the social nature of testing performance has provoked something of an intellectual crisis in the field.Language testing remains a complex and perplexing activity. While insights from evolving theories of communication may be disconcerting, it is necessary to fully grasp them and the challenge they pose if our assessments are to have any chance of having the meaning we intend them to have. Language testing is an uncertain and appropriate business at the best of times, even if to the outsider this may be camouflaged by its impressive, even daunting, and technical (technological) trappings, not to mention the authority of the institutions whose goals tests serve. Every test is vulnerable to good questions, about language and language use, about measurement, about test procedures, and about the uses to which the information in tests is to be put. In particular, a language test is only as good as the theory of language on which it is based, and it is within this area of theoretical inquiry into the essential nature of language and communication that we need to develop our ability to ask the next question.After reading this book, I find it very important to master some knowledge concerning language testing for us future English teachers. Besides, I’ve gained a lot about judging others from the relevant chapters such as the rating process and measurement etc. In a word, the language testing is so useful that we’re supposed to absorb the essence and explore more about the subject.。
论英语语言测试的信度和效度
论英语语言测试的信度和效度摘要:信度和效度是英语语言测试中两个比较重要的评价标准,也是衡量测试是否有效且可靠的重要因素。
信度是指测试结果的可靠性,可信性以及稳定性;效度是指语言测试的科学性和有效性,即考试达到预定目标的程度。
本文将深入探讨信度和效度的两个概念,并进一步阐述两者之间的关系。
关键词:英语语言测试;信度;效度;语言测试是一门具有语言教学的综合性科学,并运用一系列科学而又具有实践性的方法来客观评估学生的语言运用能力。
语言测试的标准包括信度,效度,真实度,区分度,实用性等。
在这些衡量标准中,信度和效度是两个非常重要的衡量维度,也是必须在英语语言测试中应用到的两个衡量标准。
信度和效度这两个概念最初于1930年引进到语言测试这个领域中的。
以Lado为代表的结构主义测试者,他系统地阐述并论证了信度和效度这两个概念,认为语言测试已经形成了一个科学体系,成为一个独立的学科。
从整体上看,语言测试在理论和实践上都偏向于信度和效度。
此外,信度和效度是评价学业测试的重要依据。
两者之间的关系是学术考试的基本问题,学术考试的最终目标是为语言教学服务。
因此,两者的作用在于是否对英语教学产生重要影响,是否能够支撑英语教学,是否能够实现教学目标,又是否能和学习的过程相契合。
语言测试不仅能够检查学生掌握知识的能力和水平,还能够发现学生学习中存在的潜在问题,并能够为教师之后的教学提供有效的指导和帮助。
鉴于此,本文将深入探讨信度和效度的两个概念,并进一步阐述两者之间的关系。
1.语言测试中的信度和效度信度又称有效性,是指测试结果的可靠性、可信性和稳定性,要求其结果不受受试群体和试题的干扰,从而反映被测试者真实的语言行为。
简而言之,测试结果应当客观真实地反映,不受其他因素影响。
如果一份英语试卷了信度,也就不能客观公正地反映被测试者的语言行为,那么这份试卷就失去了它的使用价值。
因此,同一份测试题在不同场合下测试,得到的结果在很大程度上保持一致,则该测试的信度是比较高的(冯彤,2003)。
语言测试效度
浅论语言测试的效度[摘要] 信度与效度是语言测试两大基本要求,信度与效度的关系问题是语言测试的根本问题。
考试的效度指的是考试在多大程度上测出预期要测量的东西,信度指的是考试结果的可靠性。
本文重点介绍了效度的含义,对效度的测量方法以及效度与信度的关系等问题做了详细的阐述。
[关键词] 语言测试效度信度[abstract] as a branch of applied linguistics, language testing has developed into a relative independent subject. validity and reliability is the most important two criteria of language testing and the relationship of both is the ultimate issue. this article makes comments on the two criteria in detail. validity is concerned with if a test measures accurately what it is intended to measure. reliability means the quality of being reliable on consistency. this article puts emhasis on validity and also explains the testing methods of validity as well as the relation between validity and reliability.[key words] language testing validity reliability一、引言语言测试学作为应用语言学的一个分支,现已发展成一个相对独立的学科。
交际语言测试理论视阈下CATTI口译三级考试 “一致性”和“效用性”研究
交际语言测试理论视阈下CATTI口译三级考试“一致性”和“效用性”研究摘要:本文以Lyle F. Bachman的交际语言测试理论为基础,运用交际语言测试理论的评价原则――“一致性”和“效用性”分析和评估全国翻译资格(水平)考试(China Accreditation Test for Translators and Interpreters ―CATTI)。
结合历年CATTI英语三级口译真题,从信度、效度、真实性、交互性、后效作用及可操作性等角度分析口译测试模式的合理性和科学性,并对CATTI对口译教与学的反拨作用等提出了一些建设性意见和建议。
关键词:交际语言测试理论;CATTI;一致性;效用性中图分类号:H319 文献标识码:A 文章编号:1671―1580(2015)03―0148―02一、引言口译测试是交际语言能力测试,首先要遵循的原则是交际语言测试理论原则(鲍晓英,2009)。
本文结合历年CATTI 英语三级口译真题,从交际语言测试理论中的信度、结构效度、真实性、交互性、后效作用及可操作性等角度分析口译测试模式的合理性和科学性,初步探讨了优化和提升CATTI 英语三级口译测试的情景真实性、交际真实性的可行性策略,并对如何优化口译测试评价信度、提升CATTI对口译教与学的正面反拨作用等提出了意见和建议。
二、交际语言测试理论美国当代著名应用语言学家Lyle F.Bachman 于1990 年在Fundamental Considerations in Language Testing一书中提出了交际英语测试设计的理论基础。
Bachman(1990)认为语言交际能力(communicative language ability)包括三部分:语言能力、策略能力(strategic competence)和心理生理机制。
1.交际语言测试的特点Bachman 和Palmer 在Language Testing in Practice一书中提出了交际语言测试的特点:真实性、功能性、交互性、情景性和综合性。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Cao Linlin
Faculty of English Language & Culture
Lecture 2
Reliability
Scores on test A
student 1 2 Score obtained 68 46 score obtained on the following day 82 28
Scores on test A
Stude nt 1 2 Score obtained 68 46 score obtained on the following day 82 28 打开Excel – 复制数据 – 工 具 –数据分析 –相关系数 –选 中区域 –确定
3 4 5 6 7 8 9 10
Which one is more reliable? Test A or B?
图示
Definition
Communicativ e language ability
TEST SCORE
Test method facets
Personal attribute s
• Reliability is concerned with answering the question “How much of an individual’s test performance is due to measurement error, or to factors other than the language ability we want to measure?” and Random with minimizing the effects of these factors factors on test scores.
– Large but inconsistent
practice effects (ability tests) – Unstable (mood) • WAIS-III most scales test-
– Pearson r: underestimate
– Spearman-Brown Formula: – rSB = 2rhh (1+rhh)
见 Excel 公式
Reliability based on item variances
• Kuder-Richardson reliability coefficients (KR-20)
pq k 1 2 rxx' k 1 s x
(k=number of items on the test; pq = sum of the item variances; p=the proportion of individuals who answer the item correctly; q=proportion of individuals who answer the item incorrectly; pq=variance of a dichotomous item, i.e. the product of these two proportions s 2 x = total test score variance)
Stability (test-retest reliability)
Definition: Stability reliability is consistency of measurement across time. Procedure: Test-retest, with the same test administered at different times, provides the estimate of stability reliability. The reliability coefficient is the correlation coefficient between the scores of the two test administrations.
retest / split-half
• Drawback: lack of precision
item selection
人格问卷的折半情况 研究对象
1 2
总分 奇数项得分 偶数项得分
55 49 28 26 27 23
3
4 5
76
37 44
34
18 23
42
19 21
6
7 8
50
57 62
30
30 33
• Why use split-half reliability • Major challenge: how to
instead of test-retest
reliability? – Cost/practical
split? Difficulty level
odd even items • How to compute?
19 89 43 56 43 27 76 62
34 67 63 59 35 23 62 49
Reliability = .74
Scores on test B
stude nt 1 2 Score obtained 65 48 score obtained on the following day 69 52 Reliability = .99
Stability (test-retest reliability)
Stability is measured by correlating test scores obtained from the same individuals over a period of time. If individuals respond consistently from one test administration to another, the correlation between scores will be high.
Reliability as Internal Consistency (内部一致性系数)
• Split-Half Reliability(分半信度) • Coefficient Alpha(α系数) • Kuder-Richardson Formula 20 -- KR20
Split-Half Reliability(分半信度)
Within the CTS model, there are three approaches to estimating reliability, each of which addresses different sources of error. Stability estimates indicate how consistent test scores are over time (the correlation between scores on the same test taken twice), equivalence estimates provide an indication of the extent to which scores on alternate forms of a test are equivalent (the correlation between scores on two or more forms of the same test taken concurrently), and internal consistency estimates are concerned primarily with sources of error from within the test and scoring procedures (the correlation among the items on a single test). The estimates of reliability that these approaches yield are called reliability coefficients.
53
44 26 32 28 38 39
60 50 40 30 20 10 0 0 20
49
46 28 34 25 34 36
r = 0.931
40
60
广东省高考英语听说考试不同套试题之间的可比性
Equivalence (parallel forms reliability)
Definition: Equivalence is determined by constructing two or more forms of an examination and administering them to the same persons at about the same time.
3 4 5 6 7 8 9 10
19 89 43 56 43 27 76 62
34 67 63 59 35 23 62 49
Scores on test B
student
1 2 3 4 5 6 7 8 9 10
Score obtained
65 48 23 85 44 56 38 19 67 52
score obtained on the following day 69 52 21 90 39 59 35 16 62 57
区间: (-1 ~ 1) The higher, the better
3 4 5 6 7 8 9 10
23 85 44 56 38 19 67 52
21 90 39 59 35 16 62 57
人格问卷重测值 研究对象 1 2 3 第一次测试 23 44 35 第二次测试 27 38 37
4
5 6 7 8 9 10