使用COCA等在线语料库相关说明

合集下载
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

1. Who created these corpora?

The corpora were created by Mark Davies, Professor of Linguistics at Brigham Young University in Provo, Utah, USA. In most cases (though see #2 below) this involved designing the corpora, collecting the texts, editing and annotating them, creating the corpus architecture, and designing and programming the web interfaces. Even though I use the terms "we" and "us" on this and other pages, most activities related to the development of most of these corpora were actually carried out by just one person.

2. Who else contributed?

3. Could you use additional funding or support?

As noted above, we have received support from the US National Endowm ent for the Humanities and Brigham Young University for the developm ent of several corpora. However, we are always in need of ongoing support for new hardware and software, to add new features, and especially to create new corpora. Because we do not charge for the use of the corpora (which are used by 80,000+ researchers, teachers, and language learners each month) and since the creation and maintenance of these corpora is essentially a "one person enterprise", any additional support would be very welcom e. There might be graduate programs in linguistics, or ESL or linguistics publishers, who might want to make a contribution, and we would then "spotlight" them on the front page of the corpora. Also, if you have contacts at a funding source like the Mellon Foundation or the MacArthur grants, please let them know about us (and no, we're not kidding).

4. What's the history of these corpora?

The first large online corpus was the Corpus del Español in 2002, followed by the BYU-BNC in 2004, the Corpus do Português in 2006, TIME Corpus in 2007, the Corpus of Contemporary American English (COCA) in 2008, and the Corpus of Historical American English (COHA) in 2010. (More details...)

5. What is the advantage of these corpora over other ones that are available?

For some languages and time periods, these are really the only corpora available. For example, in spite of earlier corpora like the American National Corpus and the Bank of English, our Corpus of Contemporary American English is the only large, balanced corpus of contemporary American English. In spite of the Brown family of corpora and the ARCHER corpus, the Corpus of Historical American English is the only large and balanced corpus of historical American English. And the Corpus del Español and the Corpus do Português are the only large, annotated corpora of these two languages. Beyond the "textual" corpora, however, the corpus architecture and interface that we have developed allows for speed, size, annotation, and a range of queries that we believe is unmatched with other architectures, and which makes it useful for corpora such as the British National Corpus, which does have other interfaces. Also, they're free -- a nice feature.

6. What software is used to index, search, and retrieve data from these corpora?

We have created our own corpus architecture, using Microsoft SQL Server as the backbone of the relational database approach. Our proprietary architecture allows for size, speed, and very good scalability that we believe are not available with any other architecture. Even complex queries of the more than 425 million word COCA corpus or the 400 million word COHA corpus typically only take one or two seconds. In addition, be cause of the relational database design, we can keep adding on more annotation "modules" with little or no performance hit. Finally, the relational database design allows for a range of queries that we believe is unmatched by any other architecture for large corpora.

7. How many people use the corpora?

As measured by Google Analytics, as of March 2011 the corpora are used by more than 80,000 unique people each month. (In other words, if the same person uses three different corpora a total of ten times that month, it counts as just one of the 80,000 unique users). The most widely-used corpus is the Corpus of Contemporary American English -- with more than 40,000 unique users each month. And people don't just come in, look for one word, and move on -- average time at the site each visit is between 10-15 minutes.

相关文档
最新文档