语料库常用术语解释 (1)
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Frank Liang
语料库语言学常用术语
Monolingual单语 corpus: a corpus which contains texts in a single language.
Multilingual多语 corpus: a corpus which represents small collections of individual monolingual corpora (or subcorpora) in the sense that they use the same or similar sampling procedures and categories for each language but contain completely different texts in those several languages.
TTR是衡量文本中词汇密度的常用方法。可 辅助说明文本的词汇难度。
但是,文本中有大量功能词(function words, 如the、a、of等)反复出现,文本每增加 一个词,形符就会增加一个,但类符却未 必随之增加。这样文本越长,功能词重复 次数越多,TTR会越低。因此用TTR衡量词 汇密度就不合理。
语料库的方法基于真实的语言使用情况,事实胜 于雄辩
Frank Liang
A corpus can be analyzed using software tools, much like those used to find key words on the Internet, but with greater sophistication. By evaluating the results of these searches, it is possible to see how language is really used, and to find answers to questions like these:
A term that signifies a list of a particular word or sequence of words in a context. The concordance is at the centre of corpus linguistics, because it gives access to many important language patterns in texts. Concordances of major works such as the Bible and Shakespeare have been available for many years. The computer has made concordances easy to compile. (concordancer索引软件, concordance lines索引行)
– Positive keywords and negative keywords
Frank Liang
语料库语言学常用术语 Concordance索引(又称“语境中的关键词,
Key Word In Context, KWIC”) 指的是运用索引软件在语料库中查询某词或 短语的使用实例,然后将所有符合条件的语 言使用实例及其语境以清单的形式列出
Frank Liang
语料库语言学常用术语
standardized type/token ratio标准化类符/ 形符比
例如,计算每个文本每1000词的TTR, 均值处理,得出STTR
Frank Liang
语料库语言学常用术语 Frequencies/occurences(频数,出现次数) Frequency(频率) 例如每一百万词、十万词中,某单词的出
热烈欢迎来语自料全库国语各言地学的常老用师术们语!
语料库语言学常用术语 Corpus(语料库,尸体):
– (pl. corpora or corpuses): a collection of text, now usually in machine-readable form and compiled to be representative of a particular kind of language and often provided with some kind of annotation(标注).
General corpus通用语料库:
Frank Liang
语料库语言学常用术语
Token形符: an individual word Type类符: word form. 指不重复计算的
形符数。"I see a cat and a dog" contains seven tokens but only six types (the type 'a' occurs twice).
现次数 常常将某个单词在两个语料库中出现的频率
参照两个语料库的容量,用卡方检验或对 数似然率进行对比,来确定两个语料库中 的该单词的使用上是否有差异
Frank Liang
语料库语言学常用术语
Keywords 关键词
– Keywords are words whose normalized frequency in one corpus (observed corpus) is significantly higher or lower than that in another comparable corpus (reference corpus).
Comparable (reference参照) corpus: a corpus used for comparison of different (types of) languages. Comparable corpora often follow the same composition pattern. If comparable corpora are annotated, annotation schemes for the corpora are often similar.
Interpreting concordance lines can be a dSmith Tools等检索软件
Frank Liang
Concordance
1 instructed Shirl Winter to compose a note of thanks to be posted on the call board. Bake w
The sentence "Rose is a rose is a rose is a rose." was written by Gertrude Stein as part of the 1913 poem Sacred Emily .
Frank Liang
语料库语言学常用术语
type/token ratio(TTR)类符/形符比,形次 比 Rose句的TTR:4/10*100=40
为何要建立语料库?为何要用语料库方法 研究语言并将其运用于语言学习?
Frank Liang
Example: Start or begin?在口语中哪个更常用?
我们的老师经常说Let’s begin!之类的话,对吗?
Frank Liang
但有人在BNC等语料库中查到,在口语中,start更 常用。
Frank Liang
语料库语言学常用术语
Types of corpora
Annotated标注 corpus: a corpus enhanced with various types of linguistic information (or tagged corpus). An annotated corpus may be considered to be a repository of linguistic information, because the information which was implicit in the plain text has been made explicit through concrete annotation (“added value附加值”).
Frank Liang
语料库语言学常用术语
Special corpus专用语料库: A type of corpora that are assembled for a specific purpose, and they vary in size and composition according to their purpose. Special corpora are not balanced (except within the scope of their given purpose) and, if used for other purposes, give a distorted view of the language segment. Their main advantage is that the texts can be selected in such a way that the phenomena one is looking for occur much more frequently in special corpora than in a balanced corpus. A corpus that is enriched in such a way can be much smaller than a balanced corpus providing the same data.
How many words must a learner know in order to participate in everyday conversation?
Materials developed with a corpus can therefore be more authentic and can illustrate language as it is really used.
– 按照一定的采样标准采集而来的、能代 表一种语言或者某语言的一种变体或文 类的电子文本集。
Frank Liang
Corpus Linguistics 语料库语言学 立足于大量真实的语言数据,主要通过概
率统计方法,对语料库做系统而穷尽的 观察和概括得出结论。从本质上来讲, 是实证性的(empirical).
Parallel平行 (aligned) corpus: a multilingual corpus where texts in one language and their translations into other languages are aligned, sentence by sentence, preferably phrase by phrase.
What are the most frequent words and phrases in English?
Which tenses do people use most often?
What prepositions follow particular verbs?
How do people use words like can, may, and might?
The computer-generated concordances can be very flexible; the context of a word can be selected on various criteria (for example counting the words on either side).
语料库语言学常用术语
Monolingual单语 corpus: a corpus which contains texts in a single language.
Multilingual多语 corpus: a corpus which represents small collections of individual monolingual corpora (or subcorpora) in the sense that they use the same or similar sampling procedures and categories for each language but contain completely different texts in those several languages.
TTR是衡量文本中词汇密度的常用方法。可 辅助说明文本的词汇难度。
但是,文本中有大量功能词(function words, 如the、a、of等)反复出现,文本每增加 一个词,形符就会增加一个,但类符却未 必随之增加。这样文本越长,功能词重复 次数越多,TTR会越低。因此用TTR衡量词 汇密度就不合理。
语料库的方法基于真实的语言使用情况,事实胜 于雄辩
Frank Liang
A corpus can be analyzed using software tools, much like those used to find key words on the Internet, but with greater sophistication. By evaluating the results of these searches, it is possible to see how language is really used, and to find answers to questions like these:
A term that signifies a list of a particular word or sequence of words in a context. The concordance is at the centre of corpus linguistics, because it gives access to many important language patterns in texts. Concordances of major works such as the Bible and Shakespeare have been available for many years. The computer has made concordances easy to compile. (concordancer索引软件, concordance lines索引行)
– Positive keywords and negative keywords
Frank Liang
语料库语言学常用术语 Concordance索引(又称“语境中的关键词,
Key Word In Context, KWIC”) 指的是运用索引软件在语料库中查询某词或 短语的使用实例,然后将所有符合条件的语 言使用实例及其语境以清单的形式列出
Frank Liang
语料库语言学常用术语
standardized type/token ratio标准化类符/ 形符比
例如,计算每个文本每1000词的TTR, 均值处理,得出STTR
Frank Liang
语料库语言学常用术语 Frequencies/occurences(频数,出现次数) Frequency(频率) 例如每一百万词、十万词中,某单词的出
热烈欢迎来语自料全库国语各言地学的常老用师术们语!
语料库语言学常用术语 Corpus(语料库,尸体):
– (pl. corpora or corpuses): a collection of text, now usually in machine-readable form and compiled to be representative of a particular kind of language and often provided with some kind of annotation(标注).
General corpus通用语料库:
Frank Liang
语料库语言学常用术语
Token形符: an individual word Type类符: word form. 指不重复计算的
形符数。"I see a cat and a dog" contains seven tokens but only six types (the type 'a' occurs twice).
现次数 常常将某个单词在两个语料库中出现的频率
参照两个语料库的容量,用卡方检验或对 数似然率进行对比,来确定两个语料库中 的该单词的使用上是否有差异
Frank Liang
语料库语言学常用术语
Keywords 关键词
– Keywords are words whose normalized frequency in one corpus (observed corpus) is significantly higher or lower than that in another comparable corpus (reference corpus).
Comparable (reference参照) corpus: a corpus used for comparison of different (types of) languages. Comparable corpora often follow the same composition pattern. If comparable corpora are annotated, annotation schemes for the corpora are often similar.
Interpreting concordance lines can be a dSmith Tools等检索软件
Frank Liang
Concordance
1 instructed Shirl Winter to compose a note of thanks to be posted on the call board. Bake w
The sentence "Rose is a rose is a rose is a rose." was written by Gertrude Stein as part of the 1913 poem Sacred Emily .
Frank Liang
语料库语言学常用术语
type/token ratio(TTR)类符/形符比,形次 比 Rose句的TTR:4/10*100=40
为何要建立语料库?为何要用语料库方法 研究语言并将其运用于语言学习?
Frank Liang
Example: Start or begin?在口语中哪个更常用?
我们的老师经常说Let’s begin!之类的话,对吗?
Frank Liang
但有人在BNC等语料库中查到,在口语中,start更 常用。
Frank Liang
语料库语言学常用术语
Types of corpora
Annotated标注 corpus: a corpus enhanced with various types of linguistic information (or tagged corpus). An annotated corpus may be considered to be a repository of linguistic information, because the information which was implicit in the plain text has been made explicit through concrete annotation (“added value附加值”).
Frank Liang
语料库语言学常用术语
Special corpus专用语料库: A type of corpora that are assembled for a specific purpose, and they vary in size and composition according to their purpose. Special corpora are not balanced (except within the scope of their given purpose) and, if used for other purposes, give a distorted view of the language segment. Their main advantage is that the texts can be selected in such a way that the phenomena one is looking for occur much more frequently in special corpora than in a balanced corpus. A corpus that is enriched in such a way can be much smaller than a balanced corpus providing the same data.
How many words must a learner know in order to participate in everyday conversation?
Materials developed with a corpus can therefore be more authentic and can illustrate language as it is really used.
– 按照一定的采样标准采集而来的、能代 表一种语言或者某语言的一种变体或文 类的电子文本集。
Frank Liang
Corpus Linguistics 语料库语言学 立足于大量真实的语言数据,主要通过概
率统计方法,对语料库做系统而穷尽的 观察和概括得出结论。从本质上来讲, 是实证性的(empirical).
Parallel平行 (aligned) corpus: a multilingual corpus where texts in one language and their translations into other languages are aligned, sentence by sentence, preferably phrase by phrase.
What are the most frequent words and phrases in English?
Which tenses do people use most often?
What prepositions follow particular verbs?
How do people use words like can, may, and might?
The computer-generated concordances can be very flexible; the context of a word can be selected on various criteria (for example counting the words on either side).