多语种在线语料库检索平台使用简明手册.pdf
浅谈如何快速搭建英汉双语平行语料库与平行语料库检索平台
浅谈如何快速搭建英汉双语平行语料库与平行语料库检索平台语料库不仅在商业领域有着重要的作用,在翻译学研究、语用学研究以及实践教学等领域都有重要的作用。
语料库研究与应用是以语料库建设为前提,语料库建设是所有环节中最为重要的一个环节。
语料库在商业领域与科研教学领域的应用与研究的快速发展,得益于语料库建设的技术手段日益成熟,同时语料库也呈现出多样化的应用与实践。
本文通过深度探索语料库建设与应用的前沿技术发展与应用情况,重点介绍建立英汉语料库以及平行语料库应用平台所需技术支持以及详细的语料库建设与应用操作细则。
标签:语料库建设;语料库应用;双语平行语料库语料库分为单语语料库、双语语料库以及多语语料库,语料库是语言实际应用过程中产生的语言数据,例如图书的翻译、商业文件的翻译以及新闻报告的翻译等语言数据都是形成语料库的基本语料材料。
目前的研究主要是基于双语语料库的制作与应用,双语语料库也是最为广泛使用以及数量最多的语料库种类之一,语料库的存放是以数据库的形式存在为主,形成真正的语料库需要经过收集、转化、降噪、对齐、审校等诸多步骤,形成最终可用的语料库。
语料库的建设目的是多样化的,语料库的来源也是极其广泛,其中尤为重要的环节就是语料的对齐,语料对齐的速度直接决定了语料库制作的效率。
高质量的语料库是进行语料库制作与应用的基础,语料库的质量会直接影响最终的应用效果。
一、研究意义语料库的研究与应用目前在商业领域已经有了突飞猛进的发展,特别是近两年神经网络的发展,语料库对于机器翻译的发展奠定了基础,极大提高了目前谷歌、百度、搜狗、有道以及必应等机器翻译引擎的质量。
不仅如此,商业领域的巨头包括强生、中石化、微软、阿里巴巴以及腾讯等诸多公司都在不同程度的基于语料库提升在各自特定領域的机器翻译引擎质量,其中阿里巴巴的机器翻译引擎已经为中国众多企业将成千上万的商品推向全球市场提供了翻译支持。
不仅如此,语料库在学术、科研以及教学实践等应用方面都有着举足轻重的作用,利用语料库可以进行语用学、翻译学、译者行为、语言风格等多方面学术科研;同时语料库在教学中也广泛应用,通过语料库进行教学应用,教师可以将学生的翻译作业整理成语料库,利用语料库检索功能,学生可以进行自查自纠,教师也可以通过制作学生翻译作业的语料库寻找共性问题进行讲解,帮助学生解决翻译实践中产生的问题。
语料库的分类、创建和检索简述
语料库的分类
语料库的分类
根据不同的标准,语料库可以分为多种类型。常见的语料库类型包括: 1、通用语料库:包含来自不同领域、不同语言的语料,适用于广泛的研究和 应用领域。
语料库的分类
2、专业语料库:针对特定领域或专业构建的语料库,例如医学、法律、金融 等。
3、口语语料库:包含口头语言 材料,如录音、口语表达等。
二、图像分类技术
另外,降维技术也可以用于图像分类。降维技术可以将高维的图像特征降维 到低维的空间,从而使得分类更加简单和高效。常用的降维技术有PCA、t-SNE和 autoencoder等。
三、图像语义检索与分类技术的 研究现状
三、图像语义检索与分类技术的研究现状
近年来,图像语义检索和分类技术的研究取得了显著的进展。在图像语义检 索方面,研究者们提出了多种基于内容、语义相似度和向量空间模型等方法。在 图像分类方面,SVM、神经网络和降维技术等算法的应用取得了重要突破。
一、图像语义检索技术
图像语义检索是指通过自然语言描述或者用户提交的查询关键词,从图像库 中检索出与查询相关的图像。近年来,研究者们提出了多种图像语义检索的方法。
一、图像语义检索技术
基于内容的图像语义检索是通过分析图像的内容,提取出图像的特征,然后 根据这些特征进行检索。例如,可以通过提取图像的颜色、纹理、形状等特征进 行检索。另外,还可以利用深度学习技术,如卷积神经网络(CNN)来提取图像 的特征,提高检索的准确性。
语料库的创建
此外,为了便于语料库的管理和检索,需要构建语料库的索引和词典。索引 可以记录每个单词在语料库中出现的位置和频率,而词典则包含了单词的语义信 息和语法信息等。最后,语料库的创建还需要注意保证数据的安全性和隐私保护。
Trados术语库教程
上次说到可以随意下载的正版Trados相信你已经安装成功了。
让我们一起看看有些什么值得欣赏的内容。
在桌面上点击“开始”-“所有程序”-“Trados 6.5 Freelance”之后你会看到有一组项目其中包括1“文件documentation”里面是pdf格式的各种用户手册对Trados的主要模块进行了十分详尽的说明可惜都是英文版。
但既然大家都是干翻译的直接看原版手册应当是个良好的习惯没有什么疑难长句而且多有重复。
2“过滤模块Filters”这些模块可以理解为是某种专用的转换工具把一些特殊软件的字体或格式转换为翻译平台可以接受的文件以便进行后续工作。
这些软件在国内不常遇到辽倌壳澳憧梢圆挥霉芩 ?3“教材Tutorial”采用小电影的形式介绍了翻译平台和对齐模块如果你没有耐心去看原文手册也应当看完这些小电影便于从整体上快速了解Trados当然还是英文的但是高度概括而且直观。
4“专用窗口T-Windows”这些模块针对各种格式文件提供了定制化的编辑环境以便进行翻译和本地化工作你可以在这里处理诸如ExcelPowerPoint可执行文件剪贴板素材等各种含有可译文字的内容。
如果非常熟练你会发现在这里干活有时要比翻译平台还方便因为平台是个正规餐厅去那里就餐有时要讲究着装不是什么素材都能直接拿来处理而在这里则相当随便只要工具顺手拿来就用比如只管翻译幻灯片上的文字不用搭理图片是不是愿意。
5注册和版本说明。
6翻译平台本身这当然是Trados的核心也只有这部分是有加密保护的其它模块的注册都在这里体现。
或者换句话说只要在这里注册成功其它所有的部件也全都可以使用了。
7“标识符编辑器TagEditor”对于各种需要保护其内在格式但又要翻译其文字的文件需要借助这个模块进行处理。
与T-Windows不同标识符编辑器主要处理与互联网有关的文件格式HTML XML 和SGML这些格式看起来面熟吧还有用于桌面出版DTP的某些文件。
多语种在线语料库检索平台使用简明手册
)
大学
教授创建的 系列语料库检索界面(
)。类似的
在线语料库检索系统还有
、、
、
等。而当前主
流的语料库工具属于第三代,其中以
、
和
等为代表。
第四代语料库工具,将语料库与分析工具合二为一,越来越受到普通用户的青睐。在线
语料库工具通常将语料库文本按特定格式建成索引( ),存储在服务器上。用户检索响
应速度要远高于三代软件在本地电脑上的检索速度。其操作也较三代语料库软件简便得多。
之间的距离)
出现次数
检索词、中心词、节点词
查询结果每页显示的行数
查询、检索
限定条件查询
直译:在 个不同文本中返回
个匹
配项
意译:在 个文本中查到
例子
查询结果按中心词排序
简单查询(不区分大小写)
词语相关查询
附录 :复杂检索举例(查询时,选择
)
单词检索:
、
、
词码混合检索:
、
、
、
、
近义词批量检索:
、
北外语料库语言学团队网站:
表 :查询结果后续操作分项功能表
新查询,返回语料库检索首页
查询结果随询结果排序设定
搭配计算
下载保存查询结果
键,即可
(随机取样),比如,可从 万行结果中,随机抽取 行。
(频数分解)表示在进行复杂查询时,对命中的不同词项分别计
算频数。比如,查询
时,会按这 个词
结果;
( )计算特定词语在语料库中的典型搭配(
);
( )计算语料库中的核心关键词(
),等。
1
、
使用实例
标准查询模式
在简单查询模式(
双语语料库收集整理加工任务工作手册
由于收集和预处理的问题, 语料中一些段落被非法割断, 一个明显的标志就是段尾没有 合法的段落结束符号,具体情况如: (1) 文字间被截断 (2) 标点符号处被截断 (3) 单词被截断 工作人员应利用工具提供的“合并段落”功能对这类问题进行处理。 工具界面下方的段落计数提示工作人员原文文件和译文文件的段落对应情况。 若原文文 件和译文文件的段落数不同, 工作人员应检查语料中是否存在被非法割断的段落, 并进 行相应的处理( “段落切分”与“合并段落” ) 。 (注:原则上,允许原文文件和译文文件 的段落数不相同,但必须保证此差异不是由段落被非法割断所造成的。 ) 由于收集和预处理的问题, 语料中仍存在一些非法空格 (即多余的空格, 包括段首空格、
973“面向新闻领域的汉英机器翻译课题组”文档
保密级别:内部
共 1 页
4/19/2003
双语语料库收集整理加工任务 工作手册(1)— 语料的手工整理
[作 者:]柏晓静 [参与者:]常宝宝 詹卫东 吴云芳 [项目名称:] 973MT_ParaCorpus [最近修订时间:] 4/19/2003 [最近修订者:] 柏晓静 [版本号:] V1.0 [文档历史记录:] V0.5,V0.6,V0.7,V0.71,V0.72,V0.8,V0.9 [提交:] MT 组例会 [目 录 ] 1 引言........................................................................................................................................1 2 语料手工整理的具体工作内容与要求 ................................................................................1 2.1 文件层次的工作内容和要求细节 .............................................................................2 2.2 内容与格式层次的工作内容和要求细节 ................................................................2 2.3 标记层次的工作内容和要求细节 ............................................................................3 2.3.1 文件中需要标记的具体内容 ..........................................................................3 2.3.2 文件中需要标注的篇章信息 ..........................................................................4 2.3.3 文件中需要标记的其他内容 ..........................................................................5 4 样例........................................................................................................................................6 5 结束语..................................................................................................................................27
最新常用在线语料库使用简介PPT课件
字串查询区
图5-1
COCA
5. COCA界面简介(图5-1)
语料库分类区
图5-1
COCA
5. COCA界面简介(图5-1)
CCL使用说明书
一关于CCL语料库及其检索系统1.1 CCL语料库及其检索系统为纯学术非盈利性的。
不得将本系统及其产生的检索结果用于任何商业目的。
CCL不承担由此产生一切后果。
1.2 本语料库仅供语言研究参考之用。
语料本身的正确性需要您自己加以核实。
1.3 语料库中所含语料的基本内容信息可以在“高级搜索”页面上,点击相应的链接查看。
比如:“作者列表”:列出语料库中所包含的文件的作者“篇名列表”:列出语料库中所包含的篇目名“类型列表”:列出语料库中文章的分类信息“路径列表”:列出语料库中各文件在计算机中存放的目录“模式列表”:列出语料库中可以查询的模式1.4 语料库中的中文文本未经分词处理。
1.5 检索系统以汉字为基本单位。
1.6 主要功能特色:∙支持复杂检索表达式(比如不相邻关键字查询,指定距离查询,等等);∙支持对标点符号的查询(比如查询“?”可以检索语料库中所有疑问句);∙支持在“结果集”中继续检索;∙用户可定制查询结果的显示方式(如左右长度,排序等);∙用户可从网页上下载查询结果(text文件);二关于查询表达式本节对CCL语料库检索系统目前支持的查询表达式加以说明。
2.1 特殊符号查询表达式中可以使用的特殊符号包括8个:| $ # + -~ ! :这些符号分为四组:Operator1: |Operator2: $ # + - ~Operaotr3: !Delimiter: :符号的含义如下:(一)Operator1: Operator1是二元操作符,它的两边可以出现“基本项”(关于“基本项”的定义见2.2)(1)| 相当于逻辑中的“或”关系。
(二)Operator2:Operator2是二元操作符,它的两边可以出现“简单项”(关于“简单项”的定义见2.3)(2)$ 表示它两边的“简单项”按照左边在前、右边在后的次序出现于同一句中。
两个“简单项”之间相隔字数小于或等于Number(3)# 表示它两边的“简单项”出现于同一句中,不考虑前后次序。
常用在线语料库使用简介
COCA
6.1.1 检索某一词形
在显示方式区选择KWIC 并再次点击search, 可得含有“feature”的词语 索引(图6.1.1-4)
图6.1.1-4
COCA
6.1.2 检索某一词性的单词
输入“feature.[v*]”,可得到“feature”做动词时的使 用情况 (图6.1.2-1)
Ⅰ 显示及查询条件界定区,包括:显示方式区, 字串查询区,语料库分类区,查询结果排列方 式区。
Ⅱ 查询结果数据显示区
Ⅲ 例句显示区
COCA
5. COCA界面简介(图5-1)
/coca/ 显示方式区
图5-1
COCA
5. COCA界面简介(图5-1)
或者“制度,观点”的词搭配
图6.2-2
COCA
6.2 检索搭配词
点击conditions可进一步观察prevail的语境 (图6.2-3)
通过观察例句,我们发现与prevail共现的conditions常 有消极意义的词修饰,例如harsh, precarious, daunting, severe, colder and drier, dangerous等
——以BNC、COCA和Sketch Engine 为例
One-word Introduction
英国国家语料库(British National Corpus/BNC): 库 容1亿词的现代英式英语样本集合,文本来源广泛,其 中书面语占90%,口语占10%。
美国当代英语语料库((Corpus of Contemporary American English/COCA): 库容为4.5亿词的大型平 衡语料库,含有多个字库,具有多种检索功能,可免 费在线使用。
语料库 入门
语料库入门
OUTLINE
1.
基本概念 2. 著名网络语料库 3. 常用软件
Corpus(语料库,尸体): (pl. corpora or corpuses): a collection of text, now usually in machine-readable form and compiled to be representative of a particular kind of language and often provided with some kind of annotation(标注). 按照一定的采样标准采集而来的、能 代表一种语言或者某语言的一种变体 或文类的电子文本集。
在口语中,start更常用。
语料库的方法基于真实的语言
使用情况,事实胜于雄辩
我们通过对语料库的检索结果进行分析,可以找到很多问 题的答案,例如: “学知识”在英语中是“study knowledge”吗? “快速导航”翻译成“fast guide”对不对? “只为点滴幸福”这句广告语,对应的英文翻译是“Little happiness matters.”吗? 为何“The bad weather set in on Monday.”是正确的, 但“The good weather set in on Monday.”却是错误的?
熟 语 语 料 库
语料库语言学常用术语
Types
of corpora
General corpus通用语料库 Annotated标注 corpus: a corpus enhanced with various types of linguistic information (or tagged corpus). An annotated corpus may be considered to be a repository of linguistic information, because the information which was implicit in the plain text has been made explicit through concrete annotation (“added value附加值”).
语料库检索使用指南
Homework for Introduction parthttp://211.69.132.28/ 检索的库为:introduction 子语料库语料库使用练习目标一:熟悉语步与词汇的对应关系;目标二:学习以扩展意义单位为基础的新语义观(核心词、搭配、类联接、语义倾向、语义韵);目标三:掌握有语言问题后如何查找相应答案的技能。
提交的作业文件名为:姓名+introduction提交的内容: 1. 在三个introductions,标注:1)M1, M2, M3;2)每个move的内容要点(用汉语);3)每个语步的经典句型划线,4)红颜色标注:语步1中的评价性形容词、语步2的转折连词(引出现有研究的问题),语步3中代表弥补现有研究不足的表达(如研究目的等)2. 回答表格中基于语料库检索的8个问题。
提交时间:周二上课的班级提交时间为周一晚9:30:提交给刘琴同学的QQ邮箱周三上课的班级提交时间为周二晚9:30 ,提交到周颖同学的QQ邮箱Direction :1.Download 3 introduction parts from 3 journal articles in your own professional fields. Identify the 3 moves of the introduction part and mark them respectively by M1, M2, M3.and point out the main point of each move inChinese in barckets. Mark evaluative adj.in M1(评价性形容词), disjunctive conj.(转折连词)in M2, and the expressions implying filling gaps, such as research purpose in M3 in red.Move1 : statements about the subjects. (M1), (main points :problems, background information, definition, importance ,etc) , Move2 : review of relevant studies(M2) (description & comments , point out the weakness of existing researches)Move3: introduction of the present study(M3)(purposes to fill the gaps, research focuses, questions, hypothesis,etc.)2.Underline the representative sentence patterns in each move and summarize it in the bracket such as [importance]3.Answer the questions in the right column of the form based on the corpus data.(注意:如果你不会调节表格,请把答案写在表格外)Sample :The separation of mixtures of alkanes is an important activity in the petroleum and petrochemical industries. For example, the products from a catalytic催化isomerization reactor consist of a mixture of linear, mono-methyl and di-methyl alkanes. Of these, the di-branched molecules are the most desired ingredients in petrol because they have the highest octane number. It is therefore required to separate the di-methyl alkanes and recycle the linear and mono-methyl alkanes back to the isomerization reactor. In the detergent industry, the linear alkanes are the desired components and need to be separated from the alkanes mixture[M1: 通过现实需要突出研究的重要性与意义].Selective sorption on zeolites is often used for separation of alkanes mixture(1-7文献被省略). The choice of the zeolite depends on the specific separation task in hand. For example, small-pore Zeolite A are used for separation of linear alkanes using the molecular sieving principle. However, the branched molecules cannot enter the zeolite structure[M2:指出现有研究方法及方法中存在的问题]. This study aims to overcome this limitation. Both linear and branched molecules are allowed inside the medium-pore MFI matrix and the sorption hierarchy in MFI will be dictated both by the alkanes chain length and degree of branching.[M3:本研究目的和采用新方法的优势]Introduction的写作方法:说明论文特定主题与较为广泛的研究领域之间的关系,同时提供足够的背景资料。
使用COCA等在线语料库相关说明
1. Who created these corpora?The corpora were created by Mark Davies, Professor of Linguistics at Brigham Young University in Provo, Utah, USA. In most cases (though see #2 below) this involved designing the corpora, collecting the texts, editing and annotating them, creating the corpus architecture, and designing and programming the web interfaces. Even though I use the terms "we" and "us" on this and other pages, most activities related to the development of most of these corpora were actually carried out by just one person.2. Who else contributed?3. Could you use additional funding or support?As noted above, we have received support from the US National Endowm ent for the Humanities and Brigham Young University for the developm ent of several corpora. However, we are always in need of ongoing support for new hardware and software, to add new features, and especially to create new corpora. Because we do not charge for the use of the corpora (which are used by 80,000+ researchers, teachers, and language learners each month) and since the creation and maintenance of these corpora is essentially a "one person enterprise", any additional support would be very welcom e. There might be graduate programs in linguistics, or ESL or linguistics publishers, who might want to make a contribution, and we would then "spotlight" them on the front page of the corpora. Also, if you have contacts at a funding source like the Mellon Foundation or the MacArthur grants, please let them know about us (and no, we're not kidding).4. What's the history of these corpora?The first large online corpus was the Corpus del Español in 2002, followed by the BYU-BNC in 2004, the Corpus do Português in 2006, TIME Corpus in 2007, the Corpus of Contemporary American English (COCA) in 2008, and the Corpus of Historical American English (COHA) in 2010. (More details...)5. What is the advantage of these corpora over other ones that are available?For some languages and time periods, these are really the only corpora available. For example, in spite of earlier corpora like the American National Corpus and the Bank of English, our Corpus of Contemporary American English is the only large, balanced corpus of contemporary American English. In spite of the Brown family of corpora and the ARCHER corpus, the Corpus of Historical American English is the only large and balanced corpus of historical American English. And the Corpus del Español and the Corpus do Português are the only large, annotated corpora of these two languages. Beyond the "textual" corpora, however, the corpus architecture and interface that we have developed allows for speed, size, annotation, and a range of queries that we believe is unmatched with other architectures, and which makes it useful for corpora such as the British National Corpus, which does have other interfaces. Also, they're free -- a nice feature.6. What software is used to index, search, and retrieve data from these corpora?We have created our own corpus architecture, using Microsoft SQL Server as the backbone of the relational database approach. Our proprietary architecture allows for size, speed, and very good scalability that we believe are not available with any other architecture. Even complex queries of the more than 425 million word COCA corpus or the 400 million word COHA corpus typically only take one or two seconds. In addition, be cause of the relational database design, we can keep adding on more annotation "modules" with little or no performance hit. Finally, the relational database design allows for a range of queries that we believe is unmatched by any other architecture for large corpora.7. How many people use the corpora?As measured by Google Analytics, as of March 2011 the corpora are used by more than 80,000 unique people each month. (In other words, if the same person uses three different corpora a total of ten times that month, it counts as just one of the 80,000 unique users). The most widely-used corpus is the Corpus of Contemporary American English -- with more than 40,000 unique users each month. And people don't just come in, look for one word, and move on -- average time at the site each visit is between 10-15 minutes.8. What do they use the corpora for?For lots of things. Linguists use the corpora to analyze variation and change in the different languages. Some are materials developers, who use the data to create teaching materials. A high number of users are language teachers and learners, who use the corpus data to model native speaker performance and intuition. Translators use the corpora to get precise data on the target languages. Some businesses purchase data from the corpora to use in natural language processing projects. And lots of people are just curious about language, and (believe it or not) just use the corpora for fun, to see what's going on with the languages currently. If you are a registered user, you can look at the profiles of other users (by country or by interest) after you log in.9. Are there any published materials that are based on these corpora?As of mid-2011, researchers have submitted entries for more than 260 books, articles and conference presentations that are based on the corpora, and this is probably only a sm all fraction of all of the publications that have actually been done. In addition, we ourselves have published three frequency dictionaries that are based on data from the corpora -- Spanish (2005), Portuguese (2007), and American English (2010).10. How can I collaborate with other users?You can search users' profiles to find researchers from your country, or to find researchers who have similar interests. In the near future, we may start a Google Group for those who want more interaction.11. What about copyright?Our corpora contain hundreds of millions of words of copyrighted material. The only way that their use is legal (under US Fair Use Law) is because of the limited "Keyword in Context" (KWIC) displays. It's kind of like the "snippet defense" used by Google. They retrieve and index billions of words of copyright material, but they only allow end users to access"snippets" (片段,少许)of this data from their servers. Click here for an extended discussion of US Fair Use Law and how it applies to our COCA texts.12. Can I get access to the full text of these corpora?Unfortunately, no, for reasons of copyright discussed above. We would love to allow end users to have access to full-text, but we simply cannot. Even when "no one else will ever use it" and even when "it's only one article or one page" of text, we can't. We have to be 100% compliant with US Fair Use Law, and that means no full text for anyone under any circumstances -- ever. Sorry about that.13. I want more data than what's available via the standard interface. What can I do?Users can purchase derived data -- such as frequency lists, collocates lists, n-grams lists (e.g. all two or three word strings of words), or even blocks of sentences from the corpus. Basically anything, as long as it does not involve full-text access (e.g. paragraphs or pages of text), which would violate copyright restrictions. Click here for much more detailed information on this data, as well as downloadable samples.14. Can my class have additional access to a corpus on a given day?Yes. Sometimes your school will be blocked after an hour or so of heavy use from a classroom full of students. (This is a security mechanism, to prevent "bots" from running thousands of queries in a short time.) To avoid this, sign up ahead of time for "group access".15. Can you create a corpus for us, based on our own materials?Well, I probably could, but I'm not overly inclined to at this point. Creating and maintaining corpora is extremely time intensive, even when you give me the data "all ready" to import into the database. The one exception, I guess, would be if you get a large grant to create and maintain the corpus. Feel free to contact me with questions.16. How do I cite the corpora in my published articles?Please use the following information when you cite the corpus in academic publications or conference papers. And please remember to add an entry to the publication database (it takes only 30-40 seconds!). Thanks.In the first reference to the corpus in your paper, please use the full name. For example, for COCA: "the Corpus of Contemporary American English" with the appropriate citation to the references section of the paper, e.g. (Davies 2008-). After that reference, feel free touse something shorter, like "COCA" (for example: "...and as seen in COCA, there are..."). Also, please do not refer to the corpus in the body of your paper as "Mark Davies' COCA corpus", "a corpus created by Mark Davies", etc. The bibliographic entry itself is enough to indicate who created the corpus.。
多语种在线语料库检索平台 BFSU CQPweb 使用简明手册
多语种在线语料库检索平台BFSU CQPweb使用简明手册许家金中国外语教育研究中心1、访问及登录访问124.193.83.252/cqp/(用户名:test和密码:test),可点击使用相应的语料库。
目前BFSU CQPweb平台上已安装英语、汉语、德语、日语、俄语、阿拉伯语、冰岛语等7个语种35个语料库。
图1:BFSU CQPweb主界面2、CQPweb功能概要按McEnery & Hardie(2012)对语料库分析工具的时代划分,CQPweb属于第四代语料库工具,即在线语料库分析工具。
四代工具的突出代表是美国杨百翰(Brigham Young)大学Mark Davies教授创建的BYU系列语料库检索界面(/)。
类似的在线语料库检索系统还有SketchEngine、CWB、BNCweb、Phrase in English等。
而当前主流的语料库工具属于第三代,其中以WordSmith、AntConc和PowerConc等为代表。
第四代语料库工具,将语料库与分析工具合二为一,越来越受到普通用户的青睐。
在线语料库工具通常将语料库文本按特定格式建成索引(index),存储在服务器上。
用户检索响应速度要远高于三代软件在本地电脑上的检索速度。
其操作也较三代语料库软件简便得多。
四代语料库工具可完成三代语料库几乎所有的功能,其中又以CQPweb所能实现的功能最多最全。
更重的是,CQPweb是开源软件。
概括说来,CQPweb可以实现以下功能。
(1)在线生成语料库的词频表(frequency list);(2)查询(query)字词、语言结构等,以获取大量语言实例或相应结构的出现频次(frequency),并可以按语体、年代、章节、学生语言水平级别、写作题材等分别呈现查询结果;(3)计算特定词语在语料库中的典型搭配(collocation);(4)计算语料库中的核心关键词(keywords),等。
3、CQPweb使用实例3.1 标准查询模式在简单查询模式(Simple query mode)下,可输入单词、短语等进行检索。
语言技术计划公共实例语料库查看器基本使用手册说明书
BASIC MANUAL OF USE OF THE PUBLIC INSTANCE OF CORPUS VIEWER PLAN FOR THE ADVANCEMENT OF LANGUAGE TECHNOLOGYJuly/2019INDEXDOCUMENT PURPOSE 3 AVAILABLE DOCUMENTARY CORPUS 3 ACCESS TO CORPUS VIEWER 4 NAVIGATION BY CORPUS VIEWER TOOLS 5 A BRIEF INTRODUCTION TO THE MODELING OF TOPICS 5 VISUALIZING THE TOPICS THAT CHARACTERIZE A DOCUMENTARY CORPUS 6 TÓPICOS: GENERAL VISION TAB 6 TÓPICOS: TOPICS TAB 8 TÓPICOS: DOC-TÓPICOS TAB 10 TÓPICOS: CORRELACIÓN TAB 11 STUDY OF RELATIONS BETWEEN DOCUMENTS BASED ON THEIR TOPICS 12 CORRELACIÓN: DOCUMENTS TAB 12 CORRELACIÓN: ALARMAS TAB 13 DOCUMENTS SIMILAR TO AN ARBITRARY TEXT 15 SEARCH TOOL 161.DOCUMENT PURPOSEThis document provides a basic user guide of the Corpus Viewer platform for analyzing documentary collections, developed within the Language Technology Plan. It allows, through the use of natural language technologies and other artificial intelligence techniques, to analyze large volumes of unstructured textual information and infer relationships between these texts.This application serves as support for those responsible for public policies, both for the design and monitoring of policies, as well as for the management of projects calls exploiting the large collections of unstructured data available.Corpus Viewer is a tool that is in production in different entities of the Public Sector in Spain (SEAD, SEUIDI, FECYT), and users usually receive training several hours prior to their access to the tool. For access to the instance (online demonstrator)It is not practical to propose such training, and the tool itself is not designed to be self-explanatory in all its functionality, which suggests that users have a minimum of documentation to better interpret the information provided by the tool. This guide has been written for that purpose.2.AVAILABLE DOCUMENTARY CORPUSWe understand by corpus, a collection of documents whose content is expressed in natural language.As of January 18, 2020 the following documentary corpus are available in the public instance of Corpus Viewer:●ACL: It is a corpus of scientific publications in the field of computational linguistics(Association of Computational Linguistics).●CORDIS720: Research Projects funded by the European Union within the SeventhFramework and Horizon 2020 Program.●CORDIS720_AI: Contains a selection of previous corpus projects in which ArtificialIntelligence is present, either because the project develops Artificial Intelligence techniques, or because they are used in some scope of application. The selection of the projects included in the subcorpus has been carried out automatically using machine learning techniques. Theuse of these techniques makes it possible to address the labeling of a large number of projects, avoiding the high cost in time that manual labeling would entail, but inevitably implies the introduction of a certain margin of error regarding the selected projects.The following documentary corpus will soon be published on the platform:●Aid from the National Science Foundation (NSF)●American aid in the field of health sciences (NiH)● A corpus of larger scientific publications (based on Semantic Scholar)The publication of these and other corpus will be notified to active users, unless they have expressed their desire not to receive any communication.3.ACCESS TO CORPUS VIEWERAccess to the online demonstrator must be requested by sending an email to ********************************* with subject “Corpus Viewer Access”.Once your application has been processed, you will receive an email with your username and password, allowing access to the demonstrator through the following web address:https://cvdemo.plantl.gob.es/CorpusViewer/#/loginAfter identifying yourself In the system it is convenient that you change the access password initially provided, for which you must access the “Editar Perfil” option located in the drop-down menu in the upper right part of the window.Figure 1: User Profile Edition.To log out of Corpus Viewer, access this menu again, to the option "Cerrar Sesión".4.NAVIGATION BY CORPUS VIEWER TOOLSTo use the tool itself, you must access the "Menu" option in the upper tab. Once you have selected any of the available options, the following information appears on that top tab:● A list of available tabs, each of which provides a different view of the selected documentarycorpus.● A drop-down menu in which you can select the corpus with which you want to work.● A drop-down menu that allows you to select a model from those associated with theselected corpus.Figure 2: Navigation through the General Menu. Display selection based on topics.5.A BRIEF INTRODUCTION TO THE MODELING OF TOPICSThe construction of topic models is based on a machine learning technique called Latent Dirichlet Allocation (LDA). There are multiple sources on the Internet that provide information about this technique, some merely intuitive, and others addressing in greater mathematical detail the generation of topics and documents. This Quora entry contains several explanations with different levels of complexity. For reasons of academic recognition we also want to include the original paper by David Blei in which the original algorithm is proposed.For the purposes at hand, it is possibly enough to explain the following two basic concepts in a very simplistic way:●In LDA a topic can be characterized as a set of words that usually appear together in manydocuments. For example: the words gene, cellular, membrane usually co-occur frequently.LDA is able to locate these co-occurrences on the complete collection of documents, and define the topics from them. You could say that each set of words represents a possible thematic area that is what we call a topic.●In LDA a document can be characterized by a single topic, although often it is really a mixtureof topics. Again, LDA provides a vector for each document that indicates the extent to which the document belongs to each of the identified topics.The tools used in Corpus Viewer are based on Latent Dirichlet Allocation, but include some modifications made within the various contracts executed in the Language Technology Plan. The interested reader can refer to the plan's website for more information on some of these developments (currently information is only published in Spanish):https://www.plantl.gob.es/inteligencia-competitiva/resultados/desarrollos-SW/Paginas/desarrollos.apx6.VISUALIZING THE TOPICS THAT CHARACTERIZE A DOCUMENTARYCORPUSSelecting "Menú -> Tópicos estáticos: Tópicos", we have access to the following tabs:●Visión General: Allows you to study the main themes of the corpus.●Tópicos: It allows studying the main themes of the corpus.●Doc-Tópicos: It allows analyzing the themes of specific documents.●Correlación: It allows studying the relationships between themes.6.1TÓPICOS:GENERAL VISION TABThe first of the available visualizations takes us to a window in which we are shown general information about the selected documentary corpus, and about each of the topics identified for saidcorpus. It also includes an interactive graphic display of the model. As the cursor passes through the sets, a label is shown with the words that characterize each topic. If you click on one you will access the detail of that topic. Clicking again returns to the overview.In the list “Tópicos del modelo”, the following information is offered for each of the topics:●Relative profile size (estimated by the LDA model; it is related to the importance of the topicin the corpus, but a direct relationship cannot be inferred with the number of documents associated with the topic, since we have seen that the documents can belong to several topics to a different extent).● A title proposed by an expert annotator of the SEAD (bold text)●The list of words identified as most relevant to each topic (below the title of each topic).The list of topics is of the sliding type, so we must move with the cursor over it to visualize all the topics.Figure 3: General View of Corpus Viewer Topics.If we click on any of the topics (both in the graphic display and in the list of topics), the view changes to emphasize the selected topic and also shows:● A graphic display of the most relevant words of the topic (both on the interactive ballchart, as in the histogram version)● A list of the documents that best represent the selected profile. By clicking on theavailable link, we can access the text associated with the document.By clicking on the ball chart again we can move to another profile, or return to the general model display.Figure 4: Detailed visualization of topic including its description based on words, and the most characteristicdocuments of the selected topic.6.2TÓPICOS:TOPICS TABThis second tab allows a visualization of the model similar to that described in the previous case, although the selection of topics is done through a drop-down menu in which the title of the topics and their relative importance in the corpus are shown.Figure 5: Display of topics in the "tópicos" tab.Again, for the selected topic, the most representative documents are shown, and the list of the most relevant words, both in histogram and word bag format.This window also offers the possibility of emphasizing the most discriminative words of the topic (keywords) by selecting the option “Con penalización por TF/IDF”.The use of TF-IDF is common in the representation of documents using bags of words. In this case, we use an extension of this concept to represent the value of the words in each topic. Being:●TF: Term Frequency: Measures the probability of a word in a given topic.●IDF: Inverse Document Frequency: In this context, it is an inverse factor to the importanceof the term in the set of topics of the model.In this way, if we activate the option “Con penalización por TF/IDF”, the system will reweigh the weight assigned to each word within the topic, and weight will be subtracted from those words that are common to a larger number of topics (common words with little semantic relevance). In other words, we will emphasize the most discriminative words, in the sense that words that are mostly present just in the selected topic are emphasized.Finally, it is worth mentioning that the tab offers information on the standardized entropy of the topic, which gives an idea of the mainstreaming of the topic throughout the collection of documents.However, the calculation of standardized entropies currently implemented offers a low dynamic range, and the SEAD technical team is developing new indicators to better characterize horizontal and vertical topics.6.3TÓPICOS:DOC-TÓPICOS TABThe "Doc-Tópicos" tab allows you to search for documents by keywords. This search engine has the ability to "autocomplete", so that by entering some words, suggestions of documents containing them will be provided.Once the document to be analyzed has been selected, a graphic visualization of its thematic content is offered. Remember that in Latent Dirichlet Allocation each document is characterized by its level of belonging to the topics of the model.Figure 6: Detailed analysis of documents based on the most relevant topics that characterize it.As an example, the included figure shows that the document:“206298 - Deep learning and Bayesian inference for medical imaging”belongs in 56% to the topic characterized by the words “method, datum, simulation,…” (Algorithms and Modeling), in a 38% to the topic characterized by the words "patient, cancer, treatment, ..." (Cancer and Biomedical Applications), and to a lesser extent to other profiles.The graphic is interactive, which allows to expand to visualize the topics of minor importance for the document by clicking on them. To return to the more general previous view, just click on the center of the circular crown.6.4TÓPICOS:CORRELACIÓN TABLastly, the tool allows you to measure the level of correlation between topics. For this, it is estimated that the relationship between two topics is greater when these topics tend to occur together in the same documents.Navigating on the graph on the left we can select each of the topics of the model, and the links with other topics show their level of concurrence with other topics of the model. Since the figure does not have enough space to show the full title of the profiles, this information is included in textual format on the right side of the tab. When positioning in the figure on the name of a subject, the complete title will be shown in the textual information on the right side of the page. Selecting a topic on the figure shows only the relationships with it, hiding the rest of the flows.Figure 7: Visualization of the correlation between model topics. For each topic other topics that frequentlyco-occur are highlighted.Additionally, you can select the option “Con penalización por TF/IDF” that has already been explained in the previous section, as well as, choose a higher or lower threshold for correlation, so that only those relationships that exceed the threshold will be displayed.7.STUDY OF RELATIONS BETWEEN DOCUMENTS BASED ON THEIR TOPICSAs already mentioned, the topic modeling algorithm used allows each document to be represented based on its level of belonging to the different topics. This representation allows to measure “semantic distances” between documents. According to this distance, two documents are more similar to each other if their topic vectors are similar as well, that is, if they belong to the same topics to similar extents.Corpus Viewer incorporates tools that allow to exploit this semantic relationship between documents. Selecting the option "Menú -> Tópicos estáticos: Correlación" we access two tabs that exploit this information:●Documentos: Document search tool by semantic similarity.●Alarmas: Search tool for pairs of documents with very high semantic similarity.7.1CORRELACIÓN:DOCUMENTS TABThe first of the available tabs offers a document search engine that allows you to select a specific document. Once selected, a list of up to 20 documents that have a high semantic relationship with the selected document is offered.For each of the documents listed, by clicking on the different icons that appear on your right, we can:●Check their metadata, including the title and the full text of each document.●export the complete list of documents to excel.Figure 8: List of documents semantically similar to the document selected by the user.Finally, it is worth mentioning that the list allows iterative document browsing: if we click on the title of the documents in the list of similar documents, we will select that document and the tool will update the list of similar documents with those corresponding to the new document selected.To return to the complete list, just click on the "Listado inicial" button.7.2CORRELACIÓN:ALARMAS TABThis tool allows you to search for pairs of documents with very high semantic similarity. This similarity can be used to search for duplicates, or documents that have been submitted multiple times for evaluation.It should be stressed that the tool provided is not based on a search for textual similarity (as turnitin tools, etc.), but semantic similarity. Two documents can be very similar to each other as long as they combine the same topics in similar proportions. For this reason, this search tool is very robust against the presence of synonyms, rewrites of texts, etc., because the representation of the document in the topic model remains relatively stable when the text goes through revision or minor changes.Figure 9: Options for searching for “Alarms” based on semantic similarity between documents.The tool allows to determine the level of similarity required for the detection of alarms (lower and upper percentile), or to require that one of the two selected documents belong to a specific year (field “centered on year”)1. Once we have established the desired settings, we have to press the “cargar” button and the tool will load the pairs of similar documents in the drop-down menu “Alarmas encontradas”.As an example, if we select the CORDIS-IA corpus and use the default parameters, the first alarm found (with a similarity of 94%) provides the view of the following figure. We can verify that these are two projects requested in years 2009 and 2013, and that they are basically a continuation of each other.Figure 10: “Alarms” found by the application, and parallel view of two documents identified as(semantically) very similar.1 Sometimes it is interesting to decrease the upper percentile to a value less than 100% or to focus the analysis on a specific year. This can be important especially in those cases in which the documents have been subject to an OCR process (this is the case of the ACL corpus), since in certain cases there may be thematically identical documents because they are associated with the presence of noisy characters that come from a malfunction of character recognition.If we click on the “comparar paneles” option, we can see how the textual similarity of both projects is relatively low, although a high semantic similarity has been detected. Regarding textual similarity, the sentences marked in red (green) appear only in the text of the document in the left (right) panel, while the white text is the one that appears simultaneously in both documents. This example clearly illustrates the difference between this tool based on semantic similarity versus other tools based on textual similarity.Figure 11: Textual comparison panel between pairs of documents with large semantic similarity.8.DOCUMENTS SIMILAR TO AN ARBITRARY TEXTAll the functionalities described in the previous section allow to exploit semantic similarities, but their use is restricted to those documents that belong to the collections of documents already loaded in Corpus Viewer. Sometimes it can be interesting to look for similarities with other new texts provided by the user. This is possible by selecting the option "Menú -> Tópicos estáticos: Inferencia" in the main menu of Corpus Viewer.Figure 12: Tab for thematic inference about free text provided by the user, and search for documents with asimilar theme indexed in Corpus Viewer.The Inference tool is based on the following steps:1.The text provided is preprocessed using the same tools that were used for preprocessingthe documents of the active corpus.2.The text provided is “projected” on the topic model associated with the active corpus. Inthis way, we obtain a representation based on topics similar to that available for all the corpus documents loaded in Corpus Viewer.3.The semantic similarity between the text provided and each of the documents of theselected corpus is calculated, and the most similar documents are shown to the user.It is worth mentioning that this tool requires the execution of certain calculations on Corpus Viewer servers, so the response time may be a few seconds (larger when the number of documents in the selected corpus is also very large).It is also necessary to highlight that the semantic representation of the text will have better quality the longer the length of the text provided. Therefore we can expect higher quality results the longer the query text.9.SEARCH TOOLselecting "Menú -> Buscador" you can access the last of the options currently active in Corpus Viewer, which consists of a tool based on Solr and Banana. This tool offers the functionality of a BI type tool, although it integrates the available metadata with the document-based representation of topics.Currently, the search engine is in the development phase, so all the information that will be available in the final version is not incorporated, and changes in the panels that are finally incorporated in each corpus are expected.Although the development is not finished, it has been decided to leave this tab active in the open instance of Corpus Viewer, so that users can get a first impression of the type of functionality that will be provided once the development is completed.You can check the demo version with data from CORDIS for Artificial Intelligence developed on Javascript (takes a while to load).The operation of the search engine that will be incorporated into Corpus Viewer will be similar to that of the demonstrator provided, and will include all the search and grouping power provided by Solr indexing technology.。
国家语委现代汉语通用平衡语料库 标注语料库数据及使用说明
国家语委现代汉语通用平衡语料库标注语料库数据及使用说明肖航教育部语言文字应用研究所1. 国家语委现代汉语通用平衡语料库1.1 语料库全库国家语委现代汉语通用平衡语料库全库约为1亿字符,其中1997年以前的语料约7000万字符,均为手工录入印刷版语料;1997之后的语料约为3000万字符,手工录入和取自电子文本各半。
语料库的通用性和平衡性通过语料样本的广泛分布和比例控制实现。
语料库类别分布如下所示:1.2 标注语料库标注语料库为国家语委现代汉语通用平衡语料库全库的子集,约5000万字符。
标注是指分词和词类标注,已经经过3次人工校对,准确率大于>98%。
语料库全库按照预先设计的选材原则进行平衡抽样,以期达到更好的代表性。
标注语料库在样本分布方面近似于全库,不破坏语料选材的平衡原则。
标注语料库类别分布如下所示:标注语料库与全库的样本分布比较如下所示:(蓝色曲线为语料库全库;红色曲线为标注语料库)2. 国家语委现代汉语通用平衡语料库语料选材与样本分布2.1 选材原则依据材料内容,选材大体作如下分类:(下文字数为建库时数据)2.1.1 教材大中小学教材单作一类,约2000万字。
2.1.2 人文与社会科学的语言材料约占全库的60%,共3000万字,包括:·政法(含哲学、政治、宗教、法律等);·历史(含民族等)·社会(含社会学、心理、语言、教育、文艺理论、新闻学、民俗学等);·经济;·艺术(含音乐、美术、舞蹈、戏剧等);·文学(含口语);·军体;·生活(含衣食住行等方面的普及读物)。
2.1.3 自然科学(含农业、医学、工程与技术)的语言材料,应涉及其发展的各个领域。
拟从大、中、小学教材和科普读物中选取。
其中,科普读物约占6%,共300万字。
教材字数另计。
2.1.4 报刊。
以1949年以后正式出版的由国家、省、市及各个部委主办的报纸和综合性刊物为主,兼顾1949年以前的报纸和综合性刊物。
常用在线语料库使用简介分析
频数
图2.1-2
BNC
2.2 其他可下载的BNC产品
➢ BNC XML edition:BNC全库 ➢ BNC Baby:BNC子库,包含小说、新闻、科技、口语四
类文本各100万词 ➢ BNC sampler:BNC 子库,包含书面语、口语两类文本
各100万词 ➢ 注:以上语料库采用XML格式,需使用XAIRA软件检索
➢ 输入“feature”(图6.1.1-1)
图6.1.1-1
COCA
6.1.1 检索某一词形
➢ 在搜索结果区可得到“feature”的频数 (图6.1.1-2 ) ➢ 点击该词,可在例句显示区看到含有“feature”的词
条(图6.1.1-3)
图6.1.1-2
图6.1.1-3
COCA
6.1.1 检索某一词形
图5.3-1
COCA
5.3 语料库分类区
➢ 42个子语料库 (图5.3-2)
图5.3-2
COCA
5.3 语料库分类区
➢ 42个子语料库 (图5.3-3)
图5.3-3
COCA
5.4 查询结果排列方式区
➢ Sort by:检索结果的排列方式,可按频率、关联度、 或字母顺序排列,一般默认按频率排列 (图5.4-1)
➢ 在显示方式区选择KWIC 并再次点击search, 可得含有“feature”的词语 索引(图6.1.1-4)
Sketch Engine: 在线语料库管理及检索工具,可有 效总结词汇的语法及搭配行为。
BNC
1. BNC官网主界面(图1-1)
基本信息及 功能介绍区
简单搜索区 进入简单搜索功能介绍
图1-1
BNC
2. BNC在线检索功能介绍 2.1 BNC simple search
美国当代英语语料库(COCA)使用介绍
• 2.3 搜索在子语料库内(或之间)出现的频率 (或比较)(不同语域中的用法)
• 如在Fiction和Newspaper子语料库中passionate 后面可以跟任何名词的词及频率,分别如两图 (2.3-1和2.3-2)。
图2.3-1
图2.3-2
COCA主要功能(三)
• 但是也可以之间对两者子语料库中它们出现频率 的对比,操作:分别选择section 1&2,如下图(图 2.3-3):
POS LIST
verb base=动词原形 verb.INF=动词不定式 verb MODAL=情态动词 verb 3SG=动词第三人称单数 verb ED=过去式 verb EN=过去分词 verb ING=现在分词 verb.LEX=lexical verb实意动词 verb.[BE]=系动词 verb.[DO]=do verb.[HAVE]=have
美国当代英语语料库(COCA)使用说明
/coca
一、COCA语料库简介
• COCA简介
– COCA:美国当代英语语料库(Corpus of Contemporary American English)是由美国杨 伯翰大学(Brigham Young University)的 Mark Davies 教授开发的美国最新当代英语语 料库,是当今世界上最大的英语平衡语料库。
图2.1.2-1
图2.1.2-2
White+名词的短语
规则:输入名词的话用正表达式: [nn*];动词: [v*]; 形容词: [j*]; 副词: [r*];代词:[p*];连词:[c*]……
POS LIST 词性列表
POS LIST
HZAU CQPweb 简明使用手册.doc
农科英语语料库网络检索平台HZAU CQPweb 使用手册登录网址:http://211.69.132.28用户名:test 密码:test本手册分为如下几个部分:本手册分为如下几个部分:1.平台登陆界面:介绍平台登陆方式和基本界面;2.检索方式:介绍简单检索和复杂检索模式下输入检索词的格式,及得到检索结果后的后续操作,3.功能介绍:介绍HZAU CQPweb可实现的基本功能,包括:标准查询、限定条件查询、词形匹配查询、生成词频表、生成关键词表;重点介绍了随机抽样、频率分解、分布展示、排序、搭配查询等功能;4.功能拓展举例:以问答的方式举出操作实例,帮助使用者了解如何利用语料库解决实际问题;5.术语中英文对照表;6.附录-Claws7.词性赋码集和基本通配符1.平台登陆页面介绍CQPweb(Corpus Query Processor)是语料库在线检索平台,HZAU CQPweb是华农师生共建的农科英语论文语料库(总计:738.2 万词),属第四代网络语料库工具。
语料库的三层架构(见下图):第一层是农科专业期刊论文语料库Journal article,收录英语母语者发表的期刊论文838篇,共553.7万词。
第二层是汉语母语学习者语料库Learner article,收录农科专业的硕士生、博士生撰写的、完整的农科SCI论文手稿379篇,共184.5万词。
这两个语料库构架一样,均由按照章节部分和学科分类的两个子库构成:章节子库按照英语名称缩写命名,包含摘要(ABS)、引言(INT)、方法(MET)、结果(RET)、讨论(DIS)、结论(CON) 6个库,章节部分子语料库主要用于写作教学研究。
学科子库按照汉语拼音首字母缩写命名,包含植物科学(ZWKX)、动物科学(DWKX)、生命科学(SMKE)、园艺林学(YYLX)、农业经济(NYJJ)、农业工程(NYGC)、水产科学(SCKX)、食品科学(SPKX)、资源环境(ZYHJ)9个学科子库。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
多语种在线语料库检索平台使用简明手册
许家金
中国外语与教育研究中心
、访问及登录
访问(用户名:和密码:),可点击使用相应的语料库。
目前平台上已安装英语、汉语、德语、日语、俄语、阿拉伯语、冰岛语等数十个语料库。
图:主界面
、功能概要
按()对语料库分析工具的时代划分,属于第四代语料库工具,即在线语料库分析工具。
四代工具的突出代表是美国杨百翰()大学教授创建的系列语料库检索界面()。
类似的在线语料库检索系统还有、、、等。
而当前主流的语料库工具属于第三代,其中以、和等为代表。
第四代语料库工具,将语料库与分析工具合二为一,越来越受到普通用户的青睐。
在线语料库工具通常将语料库文本按特定格式建成索引(),存储在服务器上。
用户检索响应速度要远高于三代软件在本地电脑上的检索速度。
其操作也较三代语料库软件简便得多。
四代语料库工具可完成三代语料库几乎所有的功能,其中又以所能实现的功能最多最全。
更重的是,是开源软件。
概括说来,可以实现以下功能。
()在线生成语料库的词频表();
()查询()字词、语言结构等,以获取大量语言实例或相应结构的出现频次(),并可以按语体、年代、章节、学生语言水平级别、写作题材等分别呈现查询结果;
()计算特定词语在语料库中的典型搭配();
()计算语料库中的核心关键词(),等。
、使用实例
标准查询模式
在简单查询模式()下,可输入单词、短语等进行检索。
图:语料库查询界面
图:查询结果界面
点击查询结果页面右上角下拉菜单,显示(新查询)时,按键,即可重新回到语料库检索界面。
相当于返回按钮。
新查询,返回语料库检索首页
查询结果随机抽样
频数分解、分解频数
查询结果的分布展示
查询结果排序设定
搭配计算
下载保存查询结果
(随机取样),比如,可从万行结果中,随机抽取行。
(频数分解)表示在进行复杂查询时,对命中的不同词项分别计算频数。
比如,查询时,会按这个词项分别报告命中频数和频率。
图:动词查询(频数分解)结果示例
:按语体、年代、章节、学生语言水平、写作题材等分别呈现查询结果
图:语料库中"lov.*"的分布情况()
图:语料库中"lov.*"的分布情况(Bar chart)
:计算特定词语在语料库中的典型搭配
图:语料库中"lov.*"的典型搭配词(以log likelihood value排序)
限定条件查询
限定条件查询,指在任务一开始,即选定一个或多个限制条件(如,语体、年代、章节、写作题材等)进行查询。
限定条件的有无、多寡,源自语料库文本的元信息()。
因此,在创建语料库时,应尽可能详细记录语料文本产生的社会语言学信息。
丰富的社会语言学信息,可以大大丰富研究的层面和深度。
这样的元信息可以存储在文本的头部,也可以在文本之外单独存储。
图:限定在语料库的学术语体中查询情态动词生成词频表
图:语料库的词频表
生成主题词表
比如以《红楼梦》与语料库进行对比,可能得到《红楼梦》的主题性词汇。
、多语种语料库建设思路
本族语平衡语料库:百万词次以上
特定语体语域专门用途语料库:比如文学作品、新闻报导、法律文本、网络文本等学习者语料库:学习者作文、翻译练习
翻译文本及平行语料库
附录:平台中英文术语对照表(表)
词次
词种
词语搭配
语料库说明文档
语料库元信息
复杂检索语法
分布(按语体等分类条件分别呈现结果)
频数、频率
频数分解、分解频数
词频表、词表
词频表
主题词
对数似然率(典型词语搭配的统计方法)
最大跨距(计算搭配时中心词和左右语境词
之间的距离)
出现次数
检索词、中心词、节点词
查询结果每页显示的行数
查询、检索
限定条件查询
直译:在个不同文本中返回个匹
配项
意译:在个文本中查到例子
查询结果按中心词排序
简单查询(不区分大小写)
词语相关查询
附录:复杂检索举例(查询时,选择)
单词检索:、、
词码混合检索:、、、
、
近义词批量检索:、
北外语料库语言学团队网站:
使用北外平台,可引用:
许家金、吴良平,,基于网络的第四代语料库分析工具及应用实例,《外语电化教学》():,。