corpus introduction--section 1--语料库

合集下载

语料库的发展历程

语料库的发展历程
CORPUS LINGUISTICS
0.2 发展历史与现状
语料库语言学的发展历史,大致可以分为两个时期:
计算机化以前时期,可称之为传统语料库时期 计算机化以后时期,可称之为现代语料库时期
20世纪 50年代Chomsky的影响 第一代(1970- 80年代) 第二代(1980- 90年代) 第三代(1990年代) ?第四代(21世纪)
0.2.2 计算机化的语料库(现代语料库)
第二代语料库
朗文语料库(Longman Corpus Network)
商用语料库,建于上个世纪80年代 由三个大的语料库组成
朗文 /兰开斯特英语语料库(Longman/Lancaster English Language Corpus,即 LLELC) 朗文口语语料库(Longman Spoken Corpus,即 LSC) 朗文英语学习者语料库(Longman Corpus of Learners’ English ,即 LCLE)
CORPUS LINGUISTICS
0.1 语料库语言学的定义 语料库(corpus,复数形式为corpora),顾名思义就 是存放语言材料的仓库(或数据库)。而语料库语 言学则是一种以语料库为基础的语言研究方法,它 包含两层含义:
— 利用语料库对语言的某个方面进行研究,也就是说“语料 库语言学”不是一个新学科的名称,而仅仅反映了一个新 的研究手段。 — 依据语料库所反映出来的语言事实对现行语言学理论进 行批判,提出新的观点或理论。
CORPUS LINGUISTICS
0.2.2 计算机化的语料库(现代语料库)
第二代语料库
COBUILD语料库(Collins Birmingham University International Language Database) 英国国家语料库 国际英语语料库

美国当代英语语料库(COCA)使用介绍

美国当代英语语料库(COCA)使用介绍

• 2.3 搜索在子语料库内(或之间)出现的频率 (或比较)(不同语域中的用法)
• 如在Fiction和Newspaper子语料库中passionate 后面可以跟任何名词的词及频率,分别如两图 (2.3-1和2.3-2)。
图2.3-1
图2.3-2
COCA主要功能(三)
• 但是也可以之间对两者子语料库中它们出现频率 的对比,操作:分别选择section 1&2,如下图(图 2.3-3):
• 例1. 输入单词“mysterious” (图2.1.1-1):得 到相关结果(图2.1.1-2):在各子库中的频率,每 百万词使用的频率。 • 若对图2中的相应条块进行点击,那么就可以看到 KWIC,如图2.1.1-3 (以点Fiction的条块为例):
图2.1.1-1
图2.1.1-2
使用CHART显示
COCA主要功能(二)
• 如:跟在 “smile前面的形容词” (图2.2-2)
规则:在words里输入: smile.[n*],表示作为名词的smile; 在collocates里输 入: [aj*]表示其前后出现形容词的语境。
Confidence前使用的形容词 图2.2-3
COCA主要功能(三)
COCA主要功能(四)
• 2.4 进行语义倾向比较 • 2.4.1 比较近义词 • 如:近义形容词hot和warm后面所跟名词的 区别(如图2.4.1):
图2.4.1
规则:在words的方格里分别输入hot和warm,再在collocates 方框里输入[nn*],表示后面所跟任何名词。当然也可以比较在 某个子语料库中出现的频率比较。
POS LIST

verb base=动词原形 verb.INF=动词不定式 verb MODAL=情态动词 verb 3SG=动词第三人称单数 verb ED=过去式 verb EN=过去分词 verb ING=现在分词 verb.LEX=lexical verb实意动词 verb.[BE]=系动词 verb.[DO]=do verb.[HAVE]=have

语料库

语料库
15
3 语料库的设计
语料库三方面 A. 语料本身
属性 规模 领域
体裁 时代 语体 语种
语言层次

百万词级 | 千万词级 | 亿万词级 | … 政治 | 经济 | 体育 | 心理学 | …
文学 | 应用文 | 新闻 | …
共时 | 历时 书面语 | 口语 单语 | 双语 | 多语 双语平行语料库 | 双语比较语料库 语音(音节,韵律) | 语法(词,句,…)
11
第二代语料库
建于1980年代,由英国Birmingham大学 与Collins出版社合作完成,规模达2000 万词次,基于该语料库出版的Collins Cobuild词典(1987)受到了广泛的好评
COBUILD语料库 Longman语料库
千万词级 词典编纂 - 应用导向
建于1980年代,包括三个语料库: LLELC语料库(Longman/Lancaster英语语料库) LSC语料库(Longman口语语料库) LCLE(Longman英语学习语料库) 目标是编撰英语学习词典,为外国人学习英语服 务,词典规模达5000万词次
7
London-Lund英语口语语料库部分标记
标记
含义
#
语调群的结束 (end of tone group)
^
语音开始 (onset)
/
上升型核心语调 (rising nuclear tone)
\
下降型核心语调 (falling nuclear tone)
^
先升后降型核心语调 (rise-fall nuclear tone)
检索工具 | 人机界面 | 数据接口 | … 16
语料的选取
精品原则 有影响力原则 随机挑选原则 高流通度原则 典型性原则 易于获得原则 具有统计样本意义原则 符合语言规范原则

语料库

语料库

Background Information语料库的概念语料库是指按照一定的语言学规则,利用随机抽样的方法收集的有代表性的语言材料的总汇,它是语言材料的样本。

语料库通常指为语言研究机构收集的,具有一定容量的大型电子文本语料库。

它是由口语语料和书面语的样本汇集而成,用来代表特定的语言或语言变体,或经过加工后带有语言学信息标注的文本的集合。

语料库的分类按照语料库所涉及的语言种类,语料库课分为单语语料库,双语平行语料库(parallel corpus)和多语语料库(multilingual corpus);按照语言涉及的题材,语料库可分为普通语料库(general corpus)和专门用途语料库(specialized corpus);按语料的来源,又可分为口语语料库和书面语语料库;按语料库是否被标注,语料库可分为生语料库或原始语料库(raw corpus)和熟语料库或标注语料库(annotated corpus)In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe.A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus ispart-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpusin the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual.Terminology:双语或多语语料库Bilingual or multilingual corpus机器翻译技术machine translation technology双语词典编纂技术bilingual lexicography technique跟踪研究工作follow-up study设计、采集、编码和管理design, collection, coding and managementTranslation Version:关于双语或多语语料库的研究目前大致可分为三类:The research on bilingual or multilingual corpus can be divided into three categories currently:一是研究双语语料的对齐技术(Alignment),国内外学者就此提出多种策略和方法,现在已经出现了许多对齐双语或多语语料的程序或工具;First is the study of bilingual corpus alignment technology .The scholars at home and abroad propose various strategies and methods about it. There have been a lot of procedures or tools of bilingual or multilingual corpus alignment at present.二是研究双语语料的各种应用,如在基于统计的机器翻译技术、基于实例的机器翻译技术,双语词典编纂技术中,双语语料库都发挥着十分重要的作用;Second is the all kinds of applications on the research of bilingual corpus . For example, bilingual corpus play an important role in the statistics-based machine translation technology, example-based machine translationtechnology and bilingual lexicography technique.三是双语语料库的设计、采集、编码和管理问题。

浙江大学肖忠华语料库Corpus Linguistics 1-45页PPT文档资料

浙江大学肖忠华语料库Corpus Linguistics 1-45页PPT文档资料
– …but rarely a random collection of text
– Corpora “are generally assembled with particular purposes in mind, and are often assembled to be (informally speaking) representative of some language or text type.” (Leech 1992)
Reading list
• Set text
– McEnery, A., Xiao, R. and Tono, Y. (2019) CorpusBased Language Studies: An Advanced Resource Book. London: Routledge.
– Wynne, M. (2019) Developing Linguistic Corpora. Oxford: Oxbow Books. Available online at /creating/guides/linguistic-corpora
language studies.
Contents
1) Introducing corpus linguistics 2) Corpus design and types of corpora 3) Data capture and markup 4) Corpus annotation 5) Making statistic claims 6) Corpus analysis (1): concordance and wordlist 7) Corpus analysis (2): keyword analysis 8) Corpora in lexicographic and lexical studies 9) Corpora in grammatical studies 10) Corpora in diachronic studies 11) Corpora in language variation research 12) Corpora in sociolinguistic studies 13) Corpora in language education 14) Corpora in literary and stylistic studies 15) Corpora in critical discourse analysis 16) Corpora in contrastive and translation studies

语料库

语料库

关于语料库的三点基本认识:语料库中存放的是在语言的实际使用中真实出现过的语言材料;语料库是以电子计算机为载体承载语言知识的基础资源;真实语料需要经过加工(分析和处理),才能成为有用的资源;在语言学中,语料库(Corpus)指大量文本的集合,库中的文本(称为语料)通常经过整理,具有既定的格式与标记,特指计算机存储的数字化语料库。

语料库是语料库语言学研究的基础资源,也是经验主义语言研究方法的主要资源。

应用于词典编纂,语言教学,传统语言研究,自然语言处理中基于统计或实例的研究等方面。

分类语料库有多种类型,确定类型的主要依据是它的研究目的和用途,这一点往往能够体现在语料采集的原则和方式上。

有人曾经把语料库分成四种类型:(1)异质的(Heterogeneous):没有特定的语料收集原则,广泛收集并原样存储各种语料;(2)同质的(Homogeneous):只收集同一类内容的语料;(3)系统的(Systematic):根据预先确定的原则和比例收集语料,使语料具有平衡性和系统性,能够代表某一范围内的语言事实;(4)专用的(Specialized):只收集用于某一特定用途的语料。

除此之外,按照语料的语种,语料库也可以分成单语的(Monolingual)、双语的(Bilingual)和多语的(Multilingual)。

按照语料的采集单位,语料库又可以分为语篇的、语句的、短语的。

双语和多语语料库按照语料的组织形式,还可以分为平行(对齐)语料库和比较语料库,前者的语料构成译文关系,多用于机器翻译、双语词典编撰等应用领域,后者将表述同样内容的不同语言文本收集到一起,多用于语言对比研究。

目前已经累积了大量各种类型的语料库,如:葡萄牙语种树库、面向文本分类研究的中英文新闻分类语料库、路透社文本分类训练语料库、中文文本分类语料库、大开放字幕库OpenSubtitles的多语言平行语料数据(OpenSubtitles Corpus)、《圣经》双语语料库("Bible" bilingual corpus)、Short messages service(SMS ) corpus(短消息服务(SMS)语料)等。

corpus-introduction--section-1--语料库

corpus-introduction--section-1--语料库
– … pools together the intuitions of a great number of speakers – … makes linguistic analysis more objective
• This module
– …introduces the theoretical and practical issues of using corpora in linguistic studies
CL timeteds
Thurs
Fri
25
26
27
28
29
1
2
3
4
5
8
9
10
11
12
7th April (Sun): Friday timetable
CL timetable
• 27/03 (Wed) 18:30-21:30 • 28/03 (Thu) 13:15-16:40 • 29/03 (Fri) 14:05-17:30 • 03/04 (Wed) 18:30-21:30 • 07/04 (Fri) 14:05-17:30 • 10/04 (Wed) 18:30-21:30 • 11/04 (Thu) 13:15-16:40 • 12/04 (Fri) 14:05-17:30
– think critically about the strengths and weaknesses of the corpus methodology and decide when and how to interface it with other methodologies;
– get familiar with major corpus resources and tools and to develop DIY corpora when necessary;

语料库基本知识

语料库基本知识

.
6
计算语言学
◦ “计算语言学是研究用机器来处理自然语言的学科。它是由信息技
术和语言学交叉而成的”(CuS:1)。SLP没有直接提出计算语言 学的确切定义。SLP的作者在开篇借用了Stanley Kubrick科幻片中 的人物HAL,HAL是一个通晓英语的机器人。作者引入HAL的目 的在于说明,为了构建这样一个可与人通过自然语言进行交流的机 器人,需要哪些知识和技术:语言理解方面有语音识别和自然语言 理解(包括唇读技术),表达方面需要自然语言生成和语音合成, 另外HAL也需要信息检索、信息提取和推理方面的技能。而解决这 些问题一般涉及以下学科:自然语言处理,计算语言学,语音识别 和合成。SLP的作者将这三者合起来称为语音及语言处理,除了以 上HAL所用的这些技能外,SLP也囊括了其他重要的语言处理领域, 如:拼写校正、语法检查和机器翻译。
.
42
语言设定
.
43
.
44
(如逗号、句号等) 包括在内,但这一点有例外,如数字3.1415925 和整数的千分位分隔符(如100,000) 中的逗号等。
为了便于统计,对英语进行分词时通常在以上我们所说的“ 形符” 后加空格,使得他们与文本中的其他形符或符号分离开来。
.
11
类符(type)作为一个统计量,指语料库文本中任何一个独特的词形(word form)。换言之,在一个文本中,重复出现的形符只能记作一个类符。
◦ 都可以对语言学的语音、词汇、句法和语义等层面进行
统计和研究。
.
8
联系: ◦ 统计语言学和计量语言学都是利用统计方法来实现对语言成
分的统计,计量语言学以发现语言成分或语言成分间的数学 规律为目标。而统计语言学以所统计的语言特征在统计学上 显著和不显著为目标。

美国当代英语语料库(COCA)使用介绍精品名师资料

美国当代英语语料库(COCA)使用介绍精品名师资料
美国当代英语语料库(COCA)使用说明
/
免费的英语语料库资源
• /static/worldcorpora.htm
• /index.html
• /m/micase/ • http://lextutor.ca/conc/eng/ • /
COCA界面简介
COCA界面简介
• 字串查询区:
• Ⅰ、WORDS:输入字符串。 • Ⅱ、COLLOCATES:上下文限定。 • Ⅲ、POS LIST:词性列表
COCA界面简介
COCA界面简介
• 语料库分类区(五大类型语料库共包括42个 子语料库)。
• 功能:此区可以对查询的字符串限定语料类型 (Genre)和时段(Year) ,并且可以明确到查询某一 个子语料库,时段也可以查询任何一年的某个字词 的使用情况。
POS LIST





pron.INDF 不定代词 pron.PERS 人称代词 pron.WH 疑问代词 pron.REFL 反身代词 adj.CMP 形容词比较级 (comparative) adj.SPRL 形容词最高级 (superlative) adv.particle 副词小品词 adv.WH 疑问副词
图2.1.2-1
图2.1.2-2
Whiten*];动词: [v*]; 形容词: [j*]; 副词: [r*];代词:[p*];连词:[c*]……
POS LIST 词性列表
noun.ALL=名词 noun.SG=单数名词 noun.PL=复数名词 noun.CMN=普通名词 noun.+PROP=专有名词 noun.-PROP=非专有名词
COCA语料库简介

COCA简介

语料库网站网址

语料库网站网址

中央研究院近代汉语标记语料库:
语料库语言学在线:(搜LOCNESS就能出来LOCNESS)
北京大学中国语言学研究中心,简称CCL语料库检索系统(包括:现代汉语语料库、古代汉语语料库、汉英双语语料库)
闽南语典藏:.tw/

中国科学院计算所的双语语料库:/corpus/query_process.php
每个邮箱可以注册一次,免费期是一个月,免费期过了就再注册一个邮箱,再注册一次。

其中汉语语料库是没有加工的生语料库,使用价值不大。

关键是其中的英语语料库实际上是原来要付费才能使用的BNC,可以好好利用。

The Lancaster Corpus of Mandarin Chinese/scripts/download.php?otaid=2474
【在线字典、工具类】
爱词霸汉语词典/(有汉字笔顺Flash的演示,不错。

PS:爱词霸的其它链接也不错)
韩国21世纪世宗计划语料库(21세기세종계획)http://www.sejong.or.kr/ 【计算语言学里面使用最广的汉语树库】
Chinese PropBank (By U of Colorado) /chinese/cpb/。

语料库教学

语料库教学

Joybrato Mukherjee. Korpuslinguistik und Englischunterricht: Eine Ein-führung (Sprache im Kontext series 14).Frankfurt: Peter Lang, 2002. 214 pp. ISBN 3-631-39346-6. Reviewed by Hilde Hasselgård, University of Oslo. Korpuslinguistik und Englischunterricht is, as the title says, an introduction to corpus linguistics aimed at teachers of English, particularly in German upper secondary schools. At the same time, the author takes the opportunity to advo-cate the use of corpus-based materials and corpus methods in English language teaching.The author sets up four aims in the preface to the book: (i) to give a presen-tation of the development of modern corpus linguistics, with particular regard to implications for language teaching; (ii) to discuss the potential of corpus linguis-tics in relation to didactic concepts and models; (iii) to describe concrete uses of corpus data and corpus methods in English language teaching; and (iv) to makesome suggestions for future English language teaching and the education of English teachers in the light of the corpus revolution.The stated aims are reflected in the organization of the book into five chap-ters (of which two are devoted to the third aim). Chapter 1 gives an introduction to some fundamentals of corpus linguistics. The presentation includes a survey of various English language corpora, from the beginnings of corpus linguistics with the Survey of English Usage and the Brown corpus to present-day projects such as the International Corpus of English. Furthermore, some of the principles of corpus-based language description are outlined, and there are examples and case studies. These illustrations come either from the author himself or from the work of others. Particularly Kennedy (1998) is frequently cited in this part of the book. The topics as well as the examples are selected with regard to a readership of present and prospective teachers.Chapter 2, “From corpus to classroom”, discusses some central concepts of language teaching, such as the use of authentic material, communicative compe-tence, learner autonomy, and the choice of the native or the intercultural speaker as a model of proficiency. Some of the assumptions underlying corpus linguis-tics are also dealt with, such as the emphasis on empirically grounded language description, collocations and other kinds of lexicogrammatical patterns, and genre differences (cf. Biber et al. 1999).Chapters 3 and 4 show how corpus methods can be applied in the classroom by the teacher (3) or by the students (4). The author suggests ways of exploiting corpus data indirectly, through use of corpus-based teaching materials, or directly in that the teacher and/or the students use corpora in their work with the English language. Many of the ideas for corpus use are explicitly linked to stated aims in the curriculum for the upper secondary school in one of the Ger-man Bundesländer (Nordrhein-Westfalen). There are concrete suggestions as to how corpus work can be justified within the existing plan, for example to encourage students to write a corpus-based “Facharbeit”, a long essay that is usually written within literature or culture studies, but which might be linguisti-cally oriented. Each chapter is concluded with a section suggesting further read-ing, which will be useful to all those wanting to take up corpus linguistics after reading this book.The main impression of Korpuslinguistik und Englischunterricht is that of a carefully thought out volume with a firm basis both in the author’s own work and in related, reputable course books and reference works (e.g. Kennedy 1998, Biber et al. 1999, Tognini-Bonelli 2001). In spite of its slender appearance, it is also a rich volume. It is ambitious, in that the author wants to convince his read-ers and cover a lot of ground, and it is comprehensive in its survey of both cor-pus linguistics and English language teaching.In the following few paragraphs I will, however, present some critical points. The first of these has to do with the coverage of various topics. It is clearly impossible to do everything within the confines of one book, and priori-ties have to be made. Although, by and large, I agree with Mukherjee’s selection of topics to be presented, I have certain reservations as to the relative weighting of them. The reason is that the target group consists of English teachers (and stu-dents), who presumably have little prior experience with corpora.Chapters 1 and 2 are survey chapters, of corpus linguistics and concepts of foreign language teaching, respectively. They provide good overviews and inter-esting discussions, at the same time as arguing forcefully for the relevance of corpus data and corpus methods in the teaching of English. However, it takes rather a while to reach the more practical ‘applications’ section. Furthermore, it will perhaps be a disappointment to some readers that this section takes up slightly less than a third of the book, even though ideas for corpus work in the classroom can be found elsewhere as well. Thus, the first two chapters might have been reduced in favour of a more comprehensive ‘how-to-do-it’ section, and some of the theoretical considerations taken care of by means of reference to other works particularly within the didactics of language teaching, but also to some extent course books in corpus linguistics as a research field.An example of a section that would benefit from more extensive practical advice is found on p. 159, where the author lists some grammatical topics that may usefully be investigated in corpora by students. One of these is the variation between the simple present and the present progressive – a phenomenon that is notoriously difficult to search for, because it goes beyond a mere lexical search. If given this task, students (and possibly their teacher as well) would need con-crete tips on how to find useful material in the corpus.As indicated by the author in his brief discussion of “corpus literacy”towards the end of the book, it is not a trivial matter to teach students how to perform corpus searches to get at the material they want/need or how to judge their results in relation to the type of material they have, to mention but a few areas. So if the book is going to be a “manual” for teachers wanting to use cor-pora in their teaching, I think an extended practical section might have been welcome.Another point of criticism concerns the treatment of the issue of copyright, with the consequences it has for the limitations on the use and distribution of corpora. In an innocent-looking footnote (p. 155) the author remarks that the ICAME corpora, for reasons of copyright, can only be used by licence holdersand only for research. As many of the preceding case studies are based on mate-rial from these corpora, this footnote may seem to undermine much of the argu-mentation by severely restricting the possibilities of using real corpora in class. It would have been a good idea to make this reservation earlier on, and empha-size other corpora/text archives that actually can be used by teachers and stu-dents in a secondary school setting. To be fair, newspaper archives are given some attention (p. 36), as well as the freely available “Simple search of BNC World” and the Cobuild corpus concordance sampler. Besides, it seems that the BNC as well as the ICE corpora can be used in teaching. These points should, however, have been stated more clearly and at an earlier stage of the book. It is important that readers should be confident that there are corpora that they can use for their own purposes, and also where to find them, if corpus linguistics is going to find its way into the classrooms.The copyright issue might have been given some attention also in another connection, namely where the author suggests that the teachers may collect cor-pus materials themselves in connection with English for special purposes or the study of particular genres (p. 134). Although a lot of material may be easily downloaded from the Internet or scanned from books, it is not necessarily unproblematic to store and distribute it. Teachers who want to do this should at least be made aware of the most common limitations imposed by copyright restrictions.In teaching English as a foreign language, parallel corpora have obvious advantages in that they can illustrate similarities and differences between English and the learners’ L1. Parallel corpora are briefly presented on p. 60. I was surprised to see the English-Swedish Parallel Corpus (ESPC) cited as the only example, as the ESPC relies heavily on its sister project in Norway (the English-Norwegian Parallel Corpus), where much of the methodology and all the software were developed (cf. Johansson et al. 1999/2001). Furthermore, the Chemnitz Internet Grammar and Translation Corpus might have been given more attention. As mentioned briefly by the author (p. 151), the Internet Gram-mar is based on a parallel corpus of German and English. From its website, the Chemnitz Internet Grammar does not seem to be limited to academic users, and should thus be relevant to German learners of English. The learner will get access to corpus examples, although the whole corpus is open only to linguists.Mukherjee makes a good case for using corpus methods on “non-corpus”texts in a combination of content analysis and linguistic investigation. One example is a comparison between a speech by George W. Bush and an essay by Arundhati Roy (p. 101 ff.). With word frequency lists for the two texts as a start-ing point, the author proceeds to identifying patterns which some of the mostfrequent content words enter into. One may wonder, however, if all the conclu-sions drawn by the author are based on the concordance work alone, or whether they presuppose knowledge of the whole text. The suspicion that the latter may be the case is even stronger in the case of the analysis of The Cure’s texts (p. 137), where the concordance in fact tells the uninitiated reader very little about content.The idea of using corpora in cultural studies is not new (cf. e.g. Stubbs 1996: 157), but certainly interesting in a TEFL setting. Thus, the case study of the col-locational patterns of Kashmir in ICE-GB and ICE-IND is a good example of a corpus-linguistic basis for comparing British and Indian politics. It is, however, more doubtful if differences in patterns of preposition and article use (p. 99) can be classified as intercultural phenomena; rather, I would say they belong to the field of regional variation (indeed another fruitful area of corpus investigation suggested by the author).In recommending corpus-based teaching materials (Chapter 3), Mukherjee devotes a lot of well-deserved space to the Cobuild range. As his bibliography, like his examples, is generally impressively up-to-date, I was surprised to see that the 3rd edition of the Cobuild dictionary (2001) is not included, while there is frequent reference to the second edition (1995). The author’s point about con-tinually updating corpus-based dictionaries so as to include a new word such as road rage (absent from the 1995 edition) could have been better made by refer-ence to the 2001 edition, where the word indeed has an entry.In spite of the above reservations, I think Korpuslinguistik und Englischunt-erricht gives a nice introduction to the field of corpus linguistics and pays due attention to its stated readership (teachers in secondary schools). It should also be able to inspire reflection on what kind of English students should learn and how they should learn it. One can only hope that it will reach its target group and that corpora will find their natural place in the EFL classroom. ReferencesBiber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad and Edward Finegan. 1999. Longman grammar of spoken and written English. London: Longman.Chemnitz Internet Grammar and Translation Corpus:http://www.tu-chemnitz.de/phil/InternetGrammar/ [accessed 21.11.02] Johansson, Stig, Jarle Ebeling and Signe Oksefjell. 1999/2001. English-Norwe-gian Parallel Corpus manual.http://www.hf.uio.no/iba/prosjekt/ENPCmanual.html [accessed 20.11.02]Kennedy, Graeme.1998. An introduction to corpus linguistics. London: Long-man.Schmied, Joseph et al.: The Chemnitz Internet Grammar.http://www.tu-chemnitz.de/phil/InternetGrammar/publications/index.html [accessed 21.11.02]Sinclair, John (ed.). 2001. Collins English dictionary for advanced learners (3rd edition).Stubbs, Michael. 1996. Text and corpus analysis. Oxford: Blackwell.Tognini-Bonelli, Elena. 2001. Corpus linguistics at work. Amsterdam: John Benjamins Publishing Company.。

中英双语平行语料库

中英双语平行语料库

中英双语平行语料库A Parallel Corpus of Chinese and English Texts: Human PerspectiveTitle: A Joyful ReunionIntroduction:The anticipation of a reunion is always accompanied by excitement and a sense of warmth deep within the heart. Whether it is a long-lost friend or a beloved family member, the joy of being reunited with someone dear to us is indescribable. In this parallel corpus, we explore the emotions and experiences of individuals as they reunite with their loved ones, capturing the essence of these heartfelt moments through the power of language.Section 1: Longing for HomeIn this section, we delve into the stories of individuals who have been away from home for extended periods. Through their narratives, we understand the deep longing they felt for the familiar sights, sounds, and smells of their homeland. The use of vivid descriptions and evocative language enables readers to empathize with their yearning and appreciate the significance of reuniting with their roots.Section 2: Friends ReunitedFriendships are an integral part of our lives, and the joy ofreconnecting with long-lost friends is immeasurable. Through real-life anecdotes, we explore the emotions of individuals who unexpectedly crossed paths with childhood companions or college buddies. The conversations, shared memories, and laughter that ensue paint a picture of genuine happiness and the renewal of cherished bonds.Section 3: Reunited with FamilyFamily is the cornerstone of our existence, and being separated from loved ones can be emotionally challenging. In this section, we follow the journeys of individuals who were separated from their families due to various circumstances. The moments of reunion, filled with tears, hugs, and expressions of love, showcase the essence of familial connections and the overwhelming joy of being together once again.Section 4: Rekindling RomanceLove knows no boundaries, and the rekindling of romance after a period of separation can be a deeply moving experience. Through intimate stories shared by couples who endured distance and time apart, we explore the emotions of longing, anticipation, and ultimately, the euphoria of being reunited with one's soulmate. The power of love and the strength of relationships are beautifully portrayed through their heartfelt narratives.Conclusion:The parallel corpus presented here captures the essence of joyful reunions through the eyes of individuals who experienced them firsthand. By using descriptive language, evoking emotions, and presenting relatable stories, the aim is to transport readers into these moments of reunion and make them feel as if they were there themselves. The human perspective shines through in every word, ensuring a natural and immersive reading experience that transcends the boundaries of language.。

COCA语料库操作演示.ppt教程

COCA语料库操作演示.ppt教程

图2.4.2
规则:在WORDS的方格里分别输入woman和man,再在 COLLOCATES方框里输入[j*],选在左3,表示前面3个跨 距内所有的形容词。当然也可以比较在某个子语料库中出 现的频率比较。
• 2.4.3 搜索近义词 • 如:搜索beautiful的所有近义词(如图2.4.3-1)
图2.3-1
图2.3-2
图2.3-2
• 但是也可以之间对两者子语料库中它们出 现频率的对比,操作:分别选择section 1&2,如下图(图2.3-3):
图2.3-3
• 2.4 进行语义倾向比较 • 2.4.1 比较近义词 • 如:近义形容词hot和warm后面所跟名词的 区别(如图2.4.1):
图2.4.1
规则:首先选择 COMPARE 显示。然后在WORDS的方格里分 别输入hot和warm,再在COLLOCATES方框里输入[n*],表示 后面所跟任何名词。当然也可以比较在某个子语料库中出 现的频率比较。
ቤተ መጻሕፍቲ ባይዱ
• 2.4.2 比较反义词 • 如:woman和man前面所跟的形容词的区 别(如图2.4.2)
图2.1.4-1
规则:若要得到某个单词的所有单复数和时态形式,那么 就要在输入时,在这个单词外加 [ ]。
图2.1.4-2
形容词early的原形,比较级和最高级三种形式一次性检索出来检索
• 2.1.5 输入某种词性且部分带有某些字母的命令, 如要得到以 un- 开头、 -ed 结尾的所有形容词的所 有 形 式 ( 见 图 2.1.5-1 ) 和 得 到 动 词 + 任 何 词 +ground的所有词组(见图2.1.5-2): • 规则:若要得到某种词性且词中带有部分带有某 些字母的形式时,如要得到以 un- 开头、 -ed 结尾 的所有形容词的所有形式,那么输入: un*ed.[aj*]; 若要得到动词+任何词+ground的所有词组,那么输 入: [vv*]*[ground]即可。前者用来研究词汇,后者 用来查询特定词性的搭配。

byu corpus英语语料库

byu corpus英语语料库

byu corpus英语语料库Introducing the BYU Corpus: A Rich Resource for English Language Research.The Brigham Young University (BYU) Corpus is a comprehensive and diverse collection of English language materials, providing a rich resource for linguistic research and analysis. Spanning a wide range of genres and topics, the corpus offers a unique window into the structure, usage, and evolution of the English language.The Scope and Breadth of the Corpus.The BYU Corpus boasts a diverse array of materials, including books, newspapers, magazines, academic articles, blogs, social media posts, and even transcribed spoken conversations. This breadth ensures that researchers can access a wide range of linguistic variations and styles, from formal academic writing to informal social media chatter.The Value of the Corpus for Linguistic Research.The corpus is invaluable for various linguistic research projects. It can be used to study language change over time, explore the distribution and frequency of words and phrases, and analyze syntactic and semantic patterns. By analyzing the corpus, linguists can gain insights into the structure of the English language and how it is used in real-world contexts.The Technical Aspects of the Corpus.The BYU Corpus is meticulously annotated, allowing researchers to easily search and retrieve specificlinguistic features. It is also compatible with various software tools for corpus analysis, making it easy for researchers to explore and analyze the data.Applications Beyond Linguistics.While primarily used for linguistic research, the BYUCorpus can also be applied in other fields such as literature, history, and cultural studies. By analyzing the corpus, scholars can gain insights into the cultural and historical context of English language usage.Challenges and Future Directions.Despite its rich resources and广泛的应用价值, the BYU Corpus faces some challenges. One such challenge is the ever-evolving nature of the English language, which requires regular updates to the corpus to reflect new linguistic trends and patterns. Future directions for the corpus could include expanded annotation schemas, more sophisticated search and retrieval tools, and integration with other linguistic resources.Conclusion.The Brigham Young University Corpus is a comprehensive and invaluable resource for English language research. Its diverse collection of materials, meticulous annotation, and compatibility with various analysis tools make it a must-have for linguists and scholars alike. As the English language continues to evolve, the BYU Corpus will remain a critical tool for understanding its structure, usage, and cultural context.。

语料库常用术语解释 (1)

语料库常用术语解释 (1)
Frank Liang
语料库语言学常用术语
Monolingual单语 corpus: a corpus which contains texts in a single language.
Multilingual多语 corpus: a corpus which represents small collections of individual monolingual corpora (or subcorpora) in the sense that they use the same or similar sampling procedures and categories for each language but contain completely different texts in those several languages.
TTR是衡量文本中词汇密度的常用方法。可 辅助说明文本的词汇难度。
但是,文本中有大量功能词(function words, 如the、a、of等)反复出现,文本每增加 一个词,形符就会增加一个,但类符却未 必随之增加。这样文本越长,功能词重复 次数越多,TTR会越低。因此用TTR衡量词 汇密度就不合理。
语料库的方法基于真实的语言使用情况,事实胜 于雄辩
Frank Liang
A corpus can be analyzed using software tools, much like those used to find key words on the Internet, but with greater sophistication. By evaluating the results of these searches, it is possible to see how language is really used, and to find answers to questions like these:
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Introducing Corpus Linguistics
Corpus Linguistics Richard Xiao lancsxiaoz@
Module description
• Since the 1990s, the corpus methodology has revolutionized nearly all branches of linguistics
Contents
1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) Introducing corpus linguistics Corpus design and types of corpora Data capture and markup Corpus annotation Making statistic claims Corpus analysis (1): concordance and wordlist Corpus analysis (2): keyword analysis Corpora in lexicographic and lexical studies Corpora in grammatical studies Corpora in diachronic studies Corpora in language variation research Corpora in sociolinguistic studies Corpora in language education Corpora in literary and stylistic studies Corpora in critical discourse analysis Corpora in contrastive and translation studies
Outline of this session
• Lecture: introducing key concepts and debates in corpus linguistics
– – – – – – What is and is not a corpus? Why use corpora? Corpora vs. intuitions The corpus methodology A brief history of Corpus Linguistics Nature and applications of corpus-based studies
What is not a corpus?
• A list of words is not a corpus
– Building blocks of language
• A text archive is not a corpus
– A random collection of texts
• A collection of citations is not a corpus
• A text is not a corpus
– Intending to be read in different ways
CL timetable
Mon Tues Weds Thurs Fri
25
1 8
26
2 9
Байду номын сангаас
27
3 10
28
4 11
29
5 12
7th April (Sun): Friday timetable
CL timetable
• • • • • • • • 27/03 (Wed) 28/03 (Thu) 29/03 (Fri) 03/04 (Wed) 07/04 (Fri) 10/04 (Wed) 11/04 (Thu) 12/04 (Fri) 18:30-21:30 13:15-16:40 14:05-17:30 18:30-21:30 14:05-17:30 18:30-21:30 13:15-16:40 14:05-17:30 E6-224 E6-219 E6-219 E6-224 E6-219 E6-224 E6-219 E6-219
Reading list
• Set text
– McEnery, A., Xiao, R. and Tono, Y. (2006) CorpusBased Language Studies: An Advanced Resource Book. London & New York: Routledge. – Wynne, M. (2005) Developing Linguistic Corpora. Oxford: Oxbow Books. Available online at /creating/guides/linguisticcorpora


Teaching/learning strategies
• With a dual focus on „why‟ and „how to‟ in corpus-based language studies, this practical module will be delivered through a series of lectures and hands-on lab sessions • The module also engages students in extensive reading and interaction with corpus data outside of class
– … pools together the intuitions of a great number of speakers – … makes linguistic analysis more objective
• This module
– …introduces the theoretical and practical issues of using corpora in linguistic studies – …explores how the corpus-based approach and other methodologies can be combined in linguistic studies
Assessment
• Option A
– A 1,000-word essay that critically reviews a corpus exploration tool or a corpus-based study (40%) – A 2,500-word project report (60%)
• Recommended reading
– See the module syllabus at the course website
– /fass/projects/corpus/ZJU/CL_syllabus.htm (pass for unzipping ebooks: lancs)
Aims of the module
• The module aims to
– provide an introduction to corpus linguistics; – familiarise students with major corpus resources and tools; – pass on essential knowledge and skills for building DIY corpora; – to keep students up to date with the latest developments in corpus research; – develop students‟ ability in corpus-based language studies.
– Corpus analysis can be illuminating in “virtually all branches of linguistics or language learning.” (Leech 1997)
• One of the strengths of corpus data lies in its empirical and attested nature
– A short quotation which contains a word or phrase that is the reason for its selection
• A collection of quotations is not a corpus
– A short selection from a text chosen on internal criteria by human beings
• Option B
– One 3,500-word essay based on a research project of your own choice (100%)
• Deadline: Friday 31 May 2013 • Submission
– A Word copy as email attachment
• “A corpus is a collection of (1) machine-readable (2) authentic texts (including transcripts of spoken data) which is (3) sampled to be (4) representative of a particular language or language variety.” (MXT 2006: 5)
– …but rarely a random collection of text – Corpora “are generally assembled with particular purposes in mind, and are often assembled to be (informally speaking) representative of some language or text type.” (Leech 1992)
相关文档
最新文档