胡壮麟 第十章 课件 PPT 语言学教程

合集下载
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Corpus Linguistics


Corpus (plural corpora): a collection of linguistic data, either compiled as written texts or as a transcription of recorded speech. The main purpose of a corpus is to verify a hypothesis about language-for example, to determine how the usage of a particular sound, word, or syntactic construction varies. 语料(corpus,复数形式corpora):一个语言数据 的存储,可以是被编辑为书面文本,也可以是被作 为录音言语的誊本。语料的主要目的是鉴定一个语 言的假说--例如,确定一个特定的语音、单词,或 句法结构的使用如何变化。


There are also problems of practicality with corpus linguistics. How can one imagine searching through an 11-million-word corpus using nothing more than one’s eyes? Despite the criticisms, corpus linguistics continues to develop, especially after the computer slowly starts to become the mainstay of corpus linguistics.
3.1 Corpus Linguistics


Corpus linguistics deals with the principles and practice of using corpora in language study. A computer corpus is a large body of machine-readable texts. 语料库语言学:论述语言研究中使用语料的 原理和实践。一个计算机语料库是机器可读 文本的重要躯干。
Chapter Ten Language and the Computer
Corpus Linguistics 语料库语言学



Definition定义 Criticisms and the revival of corpus linguistics语 料库语言学受到的批判及其复兴 Concordance共现索引 Text encoding and annotation语篇编码和注解 The roles of corpus data语料库数据的作用
Criticisms and the revival of corpus linguistics

Chomsky changed the direction of linguistics away from empiricism to rationalism. 1. the corpus could never be a useful tool for the linguist, as the linguist must seek to model language competence rather than performance.
The roles of corpus data



Speech research Lexical studies Semantics Sociolinguistics Psycholinguistics
Speech research言语研究

A spoken corpus provides a broad sample of speech, extending over a wide selection of variables such as speaker gender, speaker age, speech class, genre, etc. This allows generalizations to be made about spoken language as the corpus is as wide and as representative as possible. It also provides for variation with a given spoken language to be studied. It also provides a sample of naturalistic speech rather than speech elicited under artificial conditions.
Concordance
计算机有能力搜索一个特定的词,词汇的顺序,甚至 一个文本里的某一个词类。计算机也能检索一个词所 有的实例,它还能计算一个词出现的次数,从而收集 到有关这个词的频率的信息。然后以某种方式对数据 进行分类。
poor in Tale of Two Cities, Book 1
Text encoding and annotation

"gives"包含词类的隐含部分的信息"第三人 称单数现在时动词",在正常阅读里,我们 仅能通过求助于预先存在的英语语法知识 来检索它。然而,在一个已经注解过的语 料里,形式"gives"可能以"gives-VVZ"的形 式出现,代码"VVZ"表示它是一个词汇中动 词(VV)的第三人称单数现在时(Z)形 式。诸如这样的注解,使检索和分析包含 在语料里的语言的信息变得更快、更容易。
Leech(1993)描写了适用于文本语料的注解的7条 准则。 1. 为了恢复到自然的语料,从有注解的语料里删去 注解是可能的。 2. 从文本里单独摘录注解是可能的。 3. 注解方案应该以终端用户可利用的指导方针为基 础。 4. 应该弄清楚,注解是如何并且由谁来完成。 5. 终端用户应该知道语料注解不是没有错误的,而 只是一种潜在的有用的工具。 6. 注解方案应尽可能地立足于普遍接受的和中性的 理论原则。 7. 任何注解方案都无优先权被视为是标准的注解。

(a) * He shines Tony books. (b) He gives Tony books. (c) He lends Tony books. (d) He owes Tony books. How can ungrammatical utterances be distinguished from ones that haven’t occurred? If the corpus does not contain sentence (a), how do we conclude that it is ungrammatical while the rest of the sentences are grammatical?
Sociolinguistics社会语言学

Although sociolinguistics is an empirical field of research it is not often rigorously sampled. Sometimes the data are also elicited rather than naturalistic data. A corpus can provide a representative sample of naturalistic data which can be quantified.



2. the only way to account for a grammar of a language is y description of its rules, rather than by enumeration of its sentences. It is the syntactic rules that are finite. 3. Even if language is a finite construct, corpus methodology is not the best method to study language.
Lexical studies词汇研究

A linguist who has access to a corpus can call up all the examples of a word or phrase from many millions of words of texts in a few seconds. Dictionaries can be produced and revised much more quickly than before, thus providing up-to-date information about language. Also, definitions can be more complete and precise since a large number of natural examples are examined.
Psycholinguistics心理语言学

In the field of psycholinguistics, sampled corpora can provide psycholinguistics with more concrete and reliable information about frequency, including the frequencies of different senses and parts of speech of ambiguous words. Next, corpora data can be used to examine the occurrence of speech errors in natural conversation. A third role for corpora lied in the analysis of language pathologies, where an accurate picture of abnormal data must be constructed before it is possible to hypothesize and test what may be wrong with the human language processing system.
Semantics语义学

Corpus linguistics contributes to semantics by helping to establish an approach which is objective, because semantic distinctions are associated in texts with characteristic observable contexts—syntactic, morphological and prosodic—and by considering he environment of the linguistic entities an empirical objective indicator for a particular semantic distinction can be arrived. Another role of corpora in semantics has been in establishing more firmly the notions of fuzzy categories and gradience. In looking empirically at natural language in corpora, clearcut boundaries do not exist; instead there are gradients of membership which are connected with frequency of inclusRPUS,13世纪,来自拉丁语的corpus一 词;意思是"body"(躯干;身体):复数形式通 常是corpora)。(1)一个文本的集合,尤其指完 整的和自身需求的文本集合;如:Anglo-Saxon诗 句的语料。(2)复数形式也可写成corpuses。在 语言学和词典编纂学上,指文本、语句或其它样 本的集会,通常作为一个电子数据库储存。一般 说来,计算机语料库可以储存上百万的流行词汇, 其特征能通过标记的方式(为词和其它构成的作 标记,并加以确认和分类)和使用共现关系程序 来分析。 语料库语言学:研究任何这样的语料中的数据。
相关文档
最新文档