语料库语言学:语料库的种类types of corpora
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Synchronic vs. diachronic corpora
• Synchronic corpora(共时语料库): materials from a specific period of time. • Diachronic corpora(历时语料库): materials over a longer period of time.
Types of corpora
• • • • • • • • • General vs. specialized corpora Written vs. spoken corpora Synchronic vs. diachronic corpora Monolingual vs. multilingual corpora Comparable vs. parallel corpora Native vs. learner corpora Sample vs. monitor corpora Raw vs. annotated corpora …
Monolingual vs. multilingual corpora
• Monolingual corpora(单语语料库): texts in one language. • Multilingual corpora(多语语料库): texts in several different languages.
Comparable vs. parallel corpora
• Comparable corpora(可比语料库): texts from two or more languages which are similar in genre, topic, register etc. without, however, containing the same content. • Parallel corpora(平行语料库)(translation corpora)(翻译语料库): a corpus of original texts in one language and their translations into another (or several other languages)。探索“同一内容是如何用两种语 言表达的” 。
• Sample corpora (样本语料库): as opposed to a monitor corpus, a sample corpus is of finite size and consists of text segments selected to provide a static picture of language • Monitor corpora (监控语料库): monitor language change. It is regularly updated and open-ended.
Leabharlann Baidueneral vs. specialized corpora
• General corpora (通用语料库) or reference corpora(参考语料库): a wide coverage of different text categories or registers; represents language for general purposes. usu.: very large , millions of words. E.g. British National Corpus (BNC), Bank of English (BOE). • specialized corpora (专用语料库): texts from a particular variety of a language, e.g. from a particular dialect or from a particular subject area.
Raw vs. annotated corpora
• Raw corpora(生语料库): in raw states of plain text; without annotations • Annotated corpora(标注语料库): some external information is added to a corpus. e.g. information identifying the origin and nature of the text; tagging to show the word class of each word; parsing to show the sentence structure and the function of different elements in a sentence. one specific example, “gives”: third person singular present tense verb In an annotated corpus, the form "gives" may be "gives_VVZ", VVZ: it is a third person singular present tense (Z) form of a lexical verb (VV). Such annotation makes it quicker and easier to retrieve and analyze information about the language contained in the corpus.
Written vs. spoken corpora
• Written corpora(笔语语料库): contain only written materials. (more) • Spoken corpora(口语语料库): contain transcribed texts of spoken language. (less)
Native vs. learner corpora
• Native speaker’s corpora(本族语语料库): texts from native speakers. • Learner corpora(学习者语料库): texts from language learners.
Sample vs. monitor corpora