语料库语言学
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Metadata(元数据)
Definition: “data about data”
Importance: metadata are critical to a corpus to help achieve the standards for representativeness, and of balance and homogeneity.
语料库语言学主要研究机器可读自然语言文本的采 集、存储、检索、统计、语法标注、句法语义分 析。
Types of Corpora
Specialised corpus(专业语料库): texts that belong to a particular type eg: academic prose General corpus(通用语料库):different types of texts assembled with the aim to serve as reference resources for linguistic research or to produce reference materials such as dictionaries.
4. Extraction of multiword units or clusters of items in a text.
Chapter II: Analyzing Corpus Data
Word Lists 词表
定义:根据单词或 词组在语篇中出现 的频率大小而排列 形成的列表。
Lemma:词目,词元 SAY: say, says,said, saying 在ELT中的应用
Historical corpora(历史语料库): texts from different periods of time, allow for the study of language change when compared with corpora from other periods. Monitor corpora(监控语料库):focus on current changes in the language. Parallel corpora(平行语料库):texts in at least two languages that have either been directly translated, or produced in different languages for the same purpose.
Editorial metadata: providing information about the relationship between corpus components and their original source. Analytic metadata: providing information about the way in which corpus components have been interpreted and analysed. Descriptive metadata: providing classificatory information derived from internal or external properties of the corpus components Administrative metadata: providing documentary information about the corpus itself, such as its title, its availability, its revision status, etc.
定义:运用索引软件在语料库中查询 某词或短语的使用实例,然后将所有 符合条件的语言使用实例及其语境以 清单的形式列出。
COCA
The Concordance output
Collocation:习惯搭配
Colligation:类链接
Semantic preference:语义倾向
Semantic prosody:语义韵
Learner corpora(学习者语料库):texts produced by learners of a language.
History of corpus design
A distinction made: One:1950s-1970s Two:1980s~ 1950s-1970s:1)London-Lund of Corpus of Spoken English (LLC) 2)Brown Corpus based on American written English 3)Lancaster-Oslo/Bergen Corpus based on written British English
Focus of Corpora
The corpora above mainly focus on the collection of general English in use. Specialised corpora : represent a particular mode of discourse eg:1)Bergen Corpus of London Teenage Language (COLT) ; dominate academic discourse eg: 2)Michigan Corpus of Academic Spoken English (MICASE) and 3)British Academic Spoken English corpus (BASE) Another category of corpora captures the language use of language learners. eg: 1)Cambridge Learner Corpus, 2)Longman Learners’ Corpus, 3) International Corpus of Learner English (ICLE), 4) Vienna-Oxford International Corpus of English (VOICE), 5) English as a Lingua Franca in Academic Settings (ELFA)
Chapter I: Introduction
What is corpus?
Formal: a large number of articles, books, magazines, etc. that have been deliberately collected together for some purpose(为某一目的而收集在一起的)大批资 料(如文章、书记、杂志等);文集;全集
Corpus Linguistics
语料库语言学
Presented by: Song Chao Wang Zeyu Li Zhanyu
Outline
Chapter I: Introduction
Chapter II: Analyzing Corpus Data
Chapter III: Current Issues in Corpus Linguistics
Corpus linguistics: tools and methods
Functionalities of corpus data: 1. Generation of frequency counts according to specified criteria; 2. Comparisons of frequency information in different texts; 3. Different formats of concordance outputs( 检索输出);
SAY 1 2 3 4 say says said saying
Fra Baidu bibliotek
Freq. 20 15 9 2
Keywords and Key sequences
Compared (对比);Frequency (频率); Extracting (筛选)
Reference corpus (参照语料库)
A transcript of medical consultation医学讨论会手稿 (口 语)
VS Solely written texts
Telephone health advice service CANCODE ( a five-million-word corpus of casual conversation)
医疗对话 VS CANCODE (英语日常对话)
(电话健康访问)
The Concordance output 索引
Technical: a large collection of written or spoken language ,that is used for studying the language.语料 库,语料汇编
What is corpus linguistics?
• Corpus linguistics :the study of machine-readable spoken and written language samples that have been assembled in a principled way for the purpose of linguistics research. It is concerned with language use in real contexts.
1980s~: 1)Collins and Birmingham University International Language Database (COBUILD)← Bank of English 2)British National Corpus (ps: COBUILD and BNC are two major corpora)Many publishing houses developed their own corpora:1)Cambridge International Corpus (CIC); 2) Longman Corpus Network; 3)Oxford English Corpus Another large corpus project: International Corpus of English (ICE) Recently: 1) American National Corpus (ANC) 2) Corpus of Contemporary American English (COCA)
Categories:
1. Editorial metadata(编辑元数据)
2. Analytic metadata(分析元数据) 3. Descriptive metadata(描写元数据)
4. Administrative metadata(管理元数据)
Categories of Metadata
Collocation:习惯搭配 ( I and am)
“Collocation refers to the habitual cooccurrence of words and will be discussed in more detail below. ” A term used to refer to the combination of words that have a certain mutual expectancy i.e. words regularly keep company with certain other words. When a collocation appears with a greater frequency than chance, then it is called a significant collocation.