第八讲_翻译系统评估
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
• Evaluation metrics – subjective judgments by human evaluators – automatic evaluation metrics – task-based evaluation, e.g.: – how much post-editing effort? – does information come across?
• Evaluators are more consistent:
Evaluation type P (A) P (E) K
Fluency
.400 .2 .250
Adequacy
.380 .2 .226
Sentence ranking .582 .333 .373
Chapter 8: Evaluation
• Basic strategy – given: machine translation output – given: human reference translation – task: compute similarity between them
Chapter 8: Evaluation
Chapter 8 Evaluation
Statistical Machine Translation
Evaluation
• How good is a given machine translation system?
• Hard problem, since many different translations acceptable → semantic equivalence / similarity
• Precision
correct
3
= = 50%
output-length 6
• Recall
correct
3
= = 43%
reference-length 7
• F-measure
precision × recall
.5 × .43
=
= 46%
(precision + recall)/2 (.5 + .43)/2
• Levenshtein distance
substitutions + insertions + deletions
wer =
reference-length
Chapter 8: Evaluation
14
Example
Israeli officials responsibility of airport safety airport security Israeli officials are responsible
12345
12345 12345 12345
(from WMT 2006 evaluation)
12345
Chapter 8: Evaluation
6
Measuring Agreement between Evaluators
• Kappa coefficient
p(A) − p(E) K=
1 − p(E)
Chapter 8: Evaluation
1
Ten Translations of a Chinese Sentence
Israeli officials are responsible for airport security. Israel is in charge of the security at this airport. The security work for this airport is the responsibility of the Israel government. Israeli side was in charge of the security of this airport. Israel is responsible for the airport’s security. Israel is responsible for safety work at this airport. Israel presides over the security of the airport. Israel took charge of the airport security. The safety of this airport is taken charge of by Israel. This airport’s security is the responsibility of the Israeli security officials.
Chapter 8: Evaluation
13
Word Error Rate
• Minimum number of editing steps to transform output to reference
match: words match, no cost substitution: replace one word with another insertion: add word deletion: drop word
0123456 Israeli 1 1 2 2 3 4 5 officials 2 2 2 3 2 3 4
are 3 3 3 3 3 2 3 responsible 4 4 4 4 4 3 2
for 5 5 5 5 5 4 3 airport 6 5 6 6 6 5 4 security 7 6 5 6 7 6 5
• Metrics Adequacy: Does the output convey the same meaning as the input sentence? Is part of the message lost, added, or distorted? Fluency: Is the output good fluent English? This involves both grammatical correctness and idiomatic word choices.
0123456 Israeli 1 0 1 2 3 4 5 officials 2 1 0 1 2 3 4
are 3 2 1 1 2 3 4 responsible 4 3 2 2 2 3 4
for 5 4 3 3 3 3 4 airport 6 5 4 4 4 3 4 security 7 6 5 5 5 4 4
11
Precision and Recall of Words
SYSTEM A: Israeli officials responsibility of airport safety
REFERENCE: Israeli officials are responsible for airport security
Chapter 8: Evaluation
9
Other Evaluation Criteria
When deploying systems, considerations go beyond quality of translations Speed: we prefer faster machine translation systems Size: fits into memory of available machines (e.g., handheld devices) Integration: can be integrated into existing workflow Customization: can be adapted to user’s needs
SYSTEM B: airport security Israeli officials are responsible
Metric precision
recall f-measure
System A 50% 43% 46%
System B 100% 100% 100%
flaw: no penalty for reordering
Chapter 8: Evaluation
3
Fluency and Adequacy: Scales
Adequacy
5 all meaning
4 most meaning
3 much meaning
2 little meaning
1
none
Fluency 5 flawless English 4 good English 3 non-native English 2 disfluent English 1 incomprehensible
Chapter 8: Evaluation
4
Annotation Tool
Chapter 8: Evaluation
5
Evaluators Disagree
• Histogram of adequacy judgments by different human evaluators
30பைடு நூலகம் 20% 10%
– p(A): proportion of times that the evaluators agree – p(E): proportion of time that they would agree by chance
(5-point scale → p(E) = 51)
• Example: Inter-evaluator agreement in WMT 2007 evaluation campaign
Chapter 8: Evaluation
10
Automatic Evaluation Metrics
• Goal: computer program that computes the quality of translations • Advantages: low cost, tunable, consistent
Chapter 8: Evaluation
12
SYSTEM A:
Precision and Recall
Israeli officials responsibility of airport safety
REFERENCE: Israeli officials are responsible for airport security
Evaluation type P (A) P (E) K
Fluency
.400 .2 .250
Adequacy
.380 .2 .226
Chapter 8: Evaluation
7
Ranking Translations
• Task for evaluator: Is translation X better than translation Y? (choices: better, worse, equal)
8
Goals for Evaluation Metrics
Low cost: reduce time and money spent on carrying out evaluation Tunable: automatically optimize system performance towards metric Meaningful: score should give intuitive interpretation of translation quality Consistent: repeated use of metric should give same results Correct: metric must rank better systems higher
(a typical example from the 2001 NIST evaluation set)
Chapter 8: Evaluation
2
Adequacy and Fluency
• Human judgement – given: machine translation output – given: source and/or reference translation – task: assess the quality of the machine translation output
• Evaluators are more consistent:
Evaluation type P (A) P (E) K
Fluency
.400 .2 .250
Adequacy
.380 .2 .226
Sentence ranking .582 .333 .373
Chapter 8: Evaluation
• Basic strategy – given: machine translation output – given: human reference translation – task: compute similarity between them
Chapter 8: Evaluation
Chapter 8 Evaluation
Statistical Machine Translation
Evaluation
• How good is a given machine translation system?
• Hard problem, since many different translations acceptable → semantic equivalence / similarity
• Precision
correct
3
= = 50%
output-length 6
• Recall
correct
3
= = 43%
reference-length 7
• F-measure
precision × recall
.5 × .43
=
= 46%
(precision + recall)/2 (.5 + .43)/2
• Levenshtein distance
substitutions + insertions + deletions
wer =
reference-length
Chapter 8: Evaluation
14
Example
Israeli officials responsibility of airport safety airport security Israeli officials are responsible
12345
12345 12345 12345
(from WMT 2006 evaluation)
12345
Chapter 8: Evaluation
6
Measuring Agreement between Evaluators
• Kappa coefficient
p(A) − p(E) K=
1 − p(E)
Chapter 8: Evaluation
1
Ten Translations of a Chinese Sentence
Israeli officials are responsible for airport security. Israel is in charge of the security at this airport. The security work for this airport is the responsibility of the Israel government. Israeli side was in charge of the security of this airport. Israel is responsible for the airport’s security. Israel is responsible for safety work at this airport. Israel presides over the security of the airport. Israel took charge of the airport security. The safety of this airport is taken charge of by Israel. This airport’s security is the responsibility of the Israeli security officials.
Chapter 8: Evaluation
13
Word Error Rate
• Minimum number of editing steps to transform output to reference
match: words match, no cost substitution: replace one word with another insertion: add word deletion: drop word
0123456 Israeli 1 1 2 2 3 4 5 officials 2 2 2 3 2 3 4
are 3 3 3 3 3 2 3 responsible 4 4 4 4 4 3 2
for 5 5 5 5 5 4 3 airport 6 5 6 6 6 5 4 security 7 6 5 6 7 6 5
• Metrics Adequacy: Does the output convey the same meaning as the input sentence? Is part of the message lost, added, or distorted? Fluency: Is the output good fluent English? This involves both grammatical correctness and idiomatic word choices.
0123456 Israeli 1 0 1 2 3 4 5 officials 2 1 0 1 2 3 4
are 3 2 1 1 2 3 4 responsible 4 3 2 2 2 3 4
for 5 4 3 3 3 3 4 airport 6 5 4 4 4 3 4 security 7 6 5 5 5 4 4
11
Precision and Recall of Words
SYSTEM A: Israeli officials responsibility of airport safety
REFERENCE: Israeli officials are responsible for airport security
Chapter 8: Evaluation
9
Other Evaluation Criteria
When deploying systems, considerations go beyond quality of translations Speed: we prefer faster machine translation systems Size: fits into memory of available machines (e.g., handheld devices) Integration: can be integrated into existing workflow Customization: can be adapted to user’s needs
SYSTEM B: airport security Israeli officials are responsible
Metric precision
recall f-measure
System A 50% 43% 46%
System B 100% 100% 100%
flaw: no penalty for reordering
Chapter 8: Evaluation
3
Fluency and Adequacy: Scales
Adequacy
5 all meaning
4 most meaning
3 much meaning
2 little meaning
1
none
Fluency 5 flawless English 4 good English 3 non-native English 2 disfluent English 1 incomprehensible
Chapter 8: Evaluation
4
Annotation Tool
Chapter 8: Evaluation
5
Evaluators Disagree
• Histogram of adequacy judgments by different human evaluators
30பைடு நூலகம் 20% 10%
– p(A): proportion of times that the evaluators agree – p(E): proportion of time that they would agree by chance
(5-point scale → p(E) = 51)
• Example: Inter-evaluator agreement in WMT 2007 evaluation campaign
Chapter 8: Evaluation
10
Automatic Evaluation Metrics
• Goal: computer program that computes the quality of translations • Advantages: low cost, tunable, consistent
Chapter 8: Evaluation
12
SYSTEM A:
Precision and Recall
Israeli officials responsibility of airport safety
REFERENCE: Israeli officials are responsible for airport security
Evaluation type P (A) P (E) K
Fluency
.400 .2 .250
Adequacy
.380 .2 .226
Chapter 8: Evaluation
7
Ranking Translations
• Task for evaluator: Is translation X better than translation Y? (choices: better, worse, equal)
8
Goals for Evaluation Metrics
Low cost: reduce time and money spent on carrying out evaluation Tunable: automatically optimize system performance towards metric Meaningful: score should give intuitive interpretation of translation quality Consistent: repeated use of metric should give same results Correct: metric must rank better systems higher
(a typical example from the 2001 NIST evaluation set)
Chapter 8: Evaluation
2
Adequacy and Fluency
• Human judgement – given: machine translation output – given: source and/or reference translation – task: assess the quality of the machine translation output