语料库常用术语解释 (1).
合集下载
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Frank Liang
语料库语言学常用术语
Special corpus专用语料库: A type of corpora that are assembled for a specific purpose, and they vary in size and composition according to their purpose. Special corpora are not balanced (except within the scope of their given purpose) and, if used for other purposes, give a distorted view of the language segment. Their main advantage is that the texts can be selected in such a way that the phenomena one is looking for occur much more frequently in special corpora than in a balanced corpus. A corpus that is enriched in such a way can be much smaller than a balanced corpus providing the same data.
热烈欢迎来语自料全库国语各言地学的常老用师术们语!
语料库语言学常用术语 Corpus(语料库,尸体):
– (pl. corpora or corpuses): a collection of text, now usually in machine-readable form and compiled to be representative of a particular kind of language and often provided with some kind of annotation(标注).
语料库的方法基于真实的语言使用情况,事实胜 于雄辩
Frank Liang
A corpus can be analyzed using software tools, much like those used to find key words on the Internet, but with greater sophistication. By evaluating the results of these searches, it is possible to see how language is really used, and to find answers to questions like these:
为何要建立语料库?为何要用语料库方法 研究语言并将其运用于语言学习?
Frank Liang
Example: Start or begin?在口语中哪个更常用?
我们的老师经常说Let’s begin!之类的话,对吗?
Frank Liang
但有人在BNC等语料库中查到,在口语中,start更 常用。
How many words must a learner know in order to participate in everyday conversation?
Materials developed with a corpus can therefore be more authentic and can illustrate language as it is really used.
Frank Liang
语料库语言学常用术语
Types of corpora
Annotated标注 corpus: a corpus enhanced with various types of linguistic information (or tagged corpus). An annotated corpus may be considered to be a repository of linguistic information, because the information which was implicit in the plain text has been made explicit through concrete annotation (“added value附加值”).
Parallel平行 (aligned) corpus: a multilingual corpus where texts in one language and their translations into other languages are aligned, seferably phrase by phrase.
– 按照一定的采样标准采集而来的、能代 表一种语言或者某语言的一种变体或文 类的电子文本集。
Frank Liang
Corpus Linguistics 语料库语言学 立足于大量真实的语言数据,主要通过概
率统计方法,对语料库做系统而穷尽的 观察和概括得出结论。从本质上来讲, 是实证性的(empirical).
Comparable (reference参照) corpus: a corpus used for comparison of different (types of) languages. Comparable corpora often follow the same composition pattern. If comparable corpora are annotated, annotation schemes for the corpora are often similar.
Frank Liang
语料库语言学常用术语
Monolingual单语 corpus: a corpus which contains texts in a single language.
Multilingual多语 corpus: a corpus which represents small collections of individual monolingual corpora (or subcorpora) in the sense that they use the same or similar sampling procedures and categories for each language but contain completely different texts in those several languages.
What are the most frequent words and phrases in English?
Which tenses do people use most often?
What prepositions follow particular verbs?
How do people use words like can, may, and might?
语料库语言学常用术语
Special corpus专用语料库: A type of corpora that are assembled for a specific purpose, and they vary in size and composition according to their purpose. Special corpora are not balanced (except within the scope of their given purpose) and, if used for other purposes, give a distorted view of the language segment. Their main advantage is that the texts can be selected in such a way that the phenomena one is looking for occur much more frequently in special corpora than in a balanced corpus. A corpus that is enriched in such a way can be much smaller than a balanced corpus providing the same data.
热烈欢迎来语自料全库国语各言地学的常老用师术们语!
语料库语言学常用术语 Corpus(语料库,尸体):
– (pl. corpora or corpuses): a collection of text, now usually in machine-readable form and compiled to be representative of a particular kind of language and often provided with some kind of annotation(标注).
语料库的方法基于真实的语言使用情况,事实胜 于雄辩
Frank Liang
A corpus can be analyzed using software tools, much like those used to find key words on the Internet, but with greater sophistication. By evaluating the results of these searches, it is possible to see how language is really used, and to find answers to questions like these:
为何要建立语料库?为何要用语料库方法 研究语言并将其运用于语言学习?
Frank Liang
Example: Start or begin?在口语中哪个更常用?
我们的老师经常说Let’s begin!之类的话,对吗?
Frank Liang
但有人在BNC等语料库中查到,在口语中,start更 常用。
How many words must a learner know in order to participate in everyday conversation?
Materials developed with a corpus can therefore be more authentic and can illustrate language as it is really used.
Frank Liang
语料库语言学常用术语
Types of corpora
Annotated标注 corpus: a corpus enhanced with various types of linguistic information (or tagged corpus). An annotated corpus may be considered to be a repository of linguistic information, because the information which was implicit in the plain text has been made explicit through concrete annotation (“added value附加值”).
Parallel平行 (aligned) corpus: a multilingual corpus where texts in one language and their translations into other languages are aligned, seferably phrase by phrase.
– 按照一定的采样标准采集而来的、能代 表一种语言或者某语言的一种变体或文 类的电子文本集。
Frank Liang
Corpus Linguistics 语料库语言学 立足于大量真实的语言数据,主要通过概
率统计方法,对语料库做系统而穷尽的 观察和概括得出结论。从本质上来讲, 是实证性的(empirical).
Comparable (reference参照) corpus: a corpus used for comparison of different (types of) languages. Comparable corpora often follow the same composition pattern. If comparable corpora are annotated, annotation schemes for the corpora are often similar.
Frank Liang
语料库语言学常用术语
Monolingual单语 corpus: a corpus which contains texts in a single language.
Multilingual多语 corpus: a corpus which represents small collections of individual monolingual corpora (or subcorpora) in the sense that they use the same or similar sampling procedures and categories for each language but contain completely different texts in those several languages.
What are the most frequent words and phrases in English?
Which tenses do people use most often?
What prepositions follow particular verbs?
How do people use words like can, may, and might?