现代汉语语料库词频表CorpusWordlist

合集下载

CQPweb_practical_guide

CQPweb_practical_guide

多语种在线语料库检索平台BFSU CQPweb使用简明手册许家金中国外语教育研究中心(2012-11-07)1、访问及登录访问124.193.83.252/cqp/(用户名:test和密码:test),可点击使用相应的语料库。

目前BFSU CQPweb平台上已安装英语、汉语、德语、日语、俄语、阿拉伯语、冰岛语等7个语种23个语料库。

图1:BFSU CQPweb主界面2、CQPweb功能概要按McEnery & Hardie(2012)对语料库分析工具的时代划分,CQPweb属于第四代语料库工具,即在线语料库分析工具。

四代工具的突出代表是美国杨百翰(Brigham Young)大学Mark Davies教授创建的BYU系列语料库检索界面(/)。

类似的在线语料库检索系统还有SketchEngine、CWB、BNCweb、Phrase in English等。

而当前主流的语料库工具属于第三代,其中以WordSmith、AntConc和PowerConc等为代表。

第四代语料库工具,将语料库与分析工具合二为一,越来越受到普通用户的青睐。

在线语料库工具通常将语料库文本按特定格式建成索引(index),存储在服务器上。

用户检索响应速度要远高于三代软件在本地电脑上的检索速度。

其操作也较三代语料库软件简便得多。

四代语料库工具可完成三代语料库几乎所有的功能,其中又以CQPweb所能实现的功能最多最全。

更重的是,CQPweb是开源软件。

概括说来,CQPweb可以实现以下功能。

(1)在线生成语料库的词频表(frequency list);(2)查询(query)字词、语言结构等,以获取大量语言实例或相应结构的出现频次(frequency),并可以按语体、年代、章节、学生语言水平级别、写作题材等分别呈现查询结果;(3)计算特定词语在语料库中的典型搭配(collocation);(4)计算语料库中的核心关键词(keywords),等。

当代英语语料库20000词频表 详细说明

当代英语语料库20000词频表 详细说明

当代英语语料库20000词频表详细说明全文共3篇示例,供读者参考篇1Contemporary English Corpus 20,000-word frequency list: Detailed ExplanationIntroductionContemporary English Corpus (CEC) is a vast collection of text data gathered from various sources, including published works, websites, and other media. This corpus serves as a valuable resource for analyzing language use and trends, providing insights into vocabulary usage, syntax patterns, and other linguistic features. The 20,000-word frequency list from the CEC offers a glimpse into the most commonly used words in contemporary English, shedding light on the language's current state and dynamics.The 20,000-word frequency listThe 20,000-word frequency list is a valuable tool for linguists, language learners, and researchers interested in understanding the most prevalent words in English. This list ranks words based on their frequency of occurrence in the CEC data, with the mostcommon words appearing at the top of the list. By studying this list, we can gain a better understanding of which words are most commonly used in everyday communication and written text.Understanding the frequency listThe frequency list provides valuable insights into the distribution of words in English text. At the top of the list, we find common words such as "the," "and," "to," and "of," which are essential components of most sentences. These words serve as building blocks for constructing meaning and conveying information. As we move down the list, we encounter words that are more specific or less frequently used but still play important roles in language use.Implications for language learningFor language learners, the frequency list can be a helpful resource for prioritizing vocabulary acquisition. By focusing on learning the most common words first, learners can improve their ability to understand and produce English language text effectively. The frequency list can also aid in identifying patterns of word usage and collocations, helping learners develop a more nuanced understanding of English vocabulary and syntax.Research applicationsResearchers in the fields of linguistics and language studies can benefit from the 20,000-word frequency list by using it to analyze language use and patterns in the CEC data. By comparing word frequencies across different genres, time periods, or populations, researchers can gain insights into changes in language use, emerging linguistic trends, and cultural influences on language. The frequency list can also serve as a benchmark for evaluating language proficiency and language processing abilities in experimental studies.Limitations of the frequency listWhile the 20,000-word frequency list offers valuable insights into language use, it is essential to recognize its limitations. The list is based on a specific corpus of data and may not be representative of all English language use. Certain words or expressions may be more prevalent in spoken language but less common in written text, leading to discrepancies in frequency rankings. Additionally, the list may not capture nuances in meaning or usage that stem from contextual factors or stylistic preferences.ConclusionThe 20,000-word frequency list from the Contemporary English Corpus provides a valuable resource for understandinglanguage use and trends in contemporary English. By analyzing the most commonly used words in English text, we can gain insights into language dynamics, vocabulary usage, and linguistic patterns. Whether for language learning, research purposes, or practical applications, the frequency list offers a wealth of information for anyone interested in exploring the richness and complexity of the English language.篇2Title: Detailed Explanation of the Contemporary English Language Corpus 20,000 Word Frequency TableIntroductionThe Contemporary English Language Corpus, commonly known as the CELEX, is a corpus of the English language that contains more than 20,000 words. This corpus is widely used by linguists and researchers to study the English language and its usage patterns. One of the key components of the CELEX is the 20,000 word frequency table, which provides a detailed breakdown of the most commonly used words in the English language. In this article, we will provide a comprehensive explanation of the 20,000 word frequency table and its significance in linguistic research.The 20,000 Word Frequency TableThe 20,000 word frequency table is a list of the 20,000 most frequently used words in the English language. These words are ranked in order of their frequency, with the most commonly used word at the top of the list and the least commonly used word at the bottom. The table provides detailed information about each word, including its frequency of use, part of speech, and contextual information.Frequency of UseThe frequency of use of a word is determined by the number of times it appears in a given corpus of text. In the case of the CELEX, the 20,000 word frequency table is compiled from a large collection of written and spoken texts, including books, newspapers, and transcripts of conversations. By analyzing the frequency of words in these texts, linguists can gain insights into the most common words used in the English language and how they are used in different contexts.Part of SpeechEach word in the 20,000 word frequency table is also categorized by its part of speech, such as noun, verb, adjective, adverb, or preposition. This information is useful for researcherswho are interested in studying the grammatical structure of the English language and how different parts of speech are used in combination to form sentences and convey meaning.Contextual InformationIn addition to frequency of use and part of speech, the 20,000 word frequency table also provides contextual information about each word. This includes details such as synonyms, antonyms, collocations, and examples of how the word is used in sentences. By examining this information, researchers can gain a deeper understanding of the nuances of meaning and usage that are associated with each word.Significance in Linguistic ResearchThe 20,000 word frequency table is a valuable resource for linguistic research in several ways. Firstly, it provides a comprehensive overview of the most commonly used words in the English language, which is essential for understanding the basic vocabulary and structure of the language. Secondly, by analyzing the frequency and distribution of words in different contexts, researchers can uncover patterns of language use and identify key features of English language usage.Furthermore, the 20,000 word frequency table can be used to compare and contrast the usage of words in different genres of text, such as fiction, non-fiction, academic writing, and conversation. This can help researchers to understand how language varies in different contexts and how meanings and interpretations of words can change depending on the genre of text in which they are used.ConclusionIn conclusion, the 20,000 word frequency table of the Contemporary English Language Corpus is a valuable resource for linguistic research. By providing detailed information about the most commonly used words in the English language, including their frequency of use, part of speech, and contextual information, the table offers insights into the structure and usage of the English language. Researchers can use this information to study language patterns, examine language variation, and gain a deeper understanding of how words are used in different contexts.篇3Contemporary English Corpus 20000 Word Frequency List Detailed ExplanationThe Contemporary English Corpus 20000 Word Frequency List is a valuable resource for linguists, language learners, and researchers interested in the frequency of words in the English language. This list provides information on the frequency of the most commonly used words in contemporary English, based on a corpus of 20,000 words.The word frequency list is a useful tool for language learners who want to focus on learning the most common words in English. By studying the words that appear most frequently in the corpus, learners can build a solid foundation of vocabulary that will help them communicate effectively in both spoken and written English. This list can also be useful for teachers who want to design language learning materials that target the most commonly used words in English.In addition to its value for language learners, the word frequency list can also be a valuable resource for researchers and linguists interested in the structure and usage of the English language. By analyzing the frequency of words in the corpus, researchers can gain insights into the patterns of language use in contemporary English. This information can be used to inform studies on language acquisition, language variation, and the evolution of the English language over time.The word frequency list is organized in descending order, with the most common words at the top of the list and less common words towards the bottom. Each word is accompanied by its frequency and percentage of occurrence in the corpus, providing users with a clear picture of the relative importance of each word in contemporary English.One of the key advantages of the word frequency list is its ability to provide a snapshot of the language at a specific point in time. By analyzing the frequency of words in the corpus, researchers can gain insights into the changing trends in language use and the emergence of new words and expressions in contemporary English.Overall, the Contemporary English Corpus 20000 Word Frequency List is a valuable resource for anyone interested in studying the English language. Whether you are a language learner looking to expand your vocabulary, a teacher designing language learning materials, or a researcher studying language use, this list provides a detailed and comprehensive overview of the most commonly used words in contemporary English.。

现代汉语单字频率列表

现代汉语单字频率列表

Modern Chinese Character Frequency List现代汉语单字频率列表Notes 说明:∙Column 1: Serial number; 第一列:序号∙Column 2: Character; 第二列:汉字∙Column 3: Individual raw frequency; 第三列:频率∙Column 4: Cumulative frequency in percentile;第四列:累计频率(%)∙Column 5: Pinyin. 请注意:拼音取自于CEDICT: Chinese-English Dictionary(/cedict.html), theonline HSK word list(/vocabulary/)和汉语拼音与输入法论坛(/bbs/1951/)的GBK汉字列表。

数字代表声调。

如果没有声调则表示轻声。

该信息只是提供给用户以参考。

其准确性没有校对过。

∙Column 6: English translation; 英文翻译。

请注意:英文翻译来源于CEDICT: Chinese-EnglishDictionary(/cedict.html)。

目前使用的数据是21 December 2005发布的版本。

该信息只是提供给用户以参考。

其准确性没有校对过。

1 的7922684 4.0943******* de/di2/di4 (possessive particle)/of, really and truly, aim/clear2 一3050722 5.67089309742 yi1 one/1/single/a(n)3 是2615490 7.022******** shi4 is/are/am/yes/to be4 不2237915 8.17906065392 bu4/bu2 (negativeprefix)/not/no5 了2128528 9.27905228304 le/liao3/liao4 (modal particle intensifying preceding clause)/(completed action marker), to know/to understand/to know, clear, look afar from a high place6 在2009181 10.3173671567 zai4 (located) at/in/exist7 人1867999 11.2827212715 ren2 man/person/people8 有1782004 12.2036344486 you3 to have/there is/there are/to exist/to be9 我1690048 13.0770261318 wo3 I/me/myself10 他1595761 13.9016916951 ta1 he/him11 这1552042 14.7037639291 zhe4/zhei4 this/these,this/these/(sometimes used before a measure word, especially in Beijing) 12 个1199580 15.3236890409 ge4 (a measureword)/individual13 们1169853 15.9282516811 men (plural marker for pronouns and a few animate nouns)14 中1104541 16.4990620505 zhong1/zhong4within/among/in/middle/center/while (doingsth)/during/China/Chinese, hit (the mark)15 来1079469 17.056915583 lai2 to come16 上1069575 17.6096560434 shang4 on/on top/upon/first (of two parts)/previous or last (week, etc.)/upper/higher/above/previous/to climb/to go into/above/to go up17 大1054064 18.1543806496 da4/dai4big/huge/large/major/great/wide/deep/oldest/eldest, doctor18 为1039036 18.6913390088 wei2/wei4 act as/take...to be/to be/to do/to serve as/to become, because of/for/to19 和1010465 19.2135322999 he2/he4/huo2/huo4and/together with/with/peace/harmony/union, cap (apoem)/respond in singing, soft/warm, mix together/to blend20 国985350 19.7227465323 guo2 country/state/nation21 地969349 20.2236916858 de/di4 (subor. part. adverbial)/-ly, earth/ground/field/place/land22 到965035 20.7224074283 dao4 to (a place)/until (a time)/up to/to go/to arrive23 以910627 21.1930059251 yi3 to use/according to/so as to/in order to/by/with/because/Israel (abbrev.)24 说874977 21.6451810318 shui4/shuo1 persuade (politically), to speak/to say25 时833532 22.0759379787 shi2o'clock/time/when/hour/season/period26 要811011 22.4950564076 yao1/yao4demand/ask/request/coerce, important/vital/to want/to be going to/must27 就771108 22.8935535592 jiu4 at once/then/right away/only/(emphasis)/to approach/to move towards/to undertake28 出755256 23.2838586328 chu1 to go out/to come out/to occur/to produce/to go beyond/to rise/to put forth/to occur/to happen/(a measure word for dramas, plays, or operas)29 会734888 23.6636378269 hui4/kuai4 can/bepossible/be able to/to assemble/to meet/to gather/tosee/union/group/association, to balance an account/accounting30 可723108 24.037329292 ke3 can/may/ableto/certain(ly)/to suit/(particle used for emphasis)31 也710259 24.404380585 ye3 also/too32 你705205 24.7688200459 ni3 you33 对703632 25.1324466038 dui4 couple/pair/to be opposite/to oppose/to face/for/to/correct (answer)/to answer/toreply/to direct (towards sth)/right34 生682031 25.4849100859 sheng1 to be born/to givebirth/life/to grow35 能665358 25.8287572096 neng2can/may/capable/energy/able36 而649239 26.1642742736 er2 and/as well as/but (not)/yet (not)/(shows causal relation)/(shows change of state)/(shows contrast)37 子640640 26.4953475023 zi3/zi 11 p.m.-1 a.m./1st earthly branch/child/midnight/son/child/seed/egg/small thing, (noun suff.)38 那638538 26.8253344486 na3/na4/nei4 how/which,that/those, that/those/(sometimes used before a measure word, especially in Beijing)39 得630688 27.1512646316 de2/de/dei3obtain/get/gain/proper/suitable/proud/contented/allow/permit/ ready/finished, a sentence particle used after a verb to showeffect/degree or possibility, to have to/must/ought to/to need to40 于630524 27.4771100619 yu2 (surname),in/at/to/from/by/than/out of41 着626326 27.8007860281 zhao1/zhao2/zhe/zhu4/zhuo2catch/receive/suffer, part. indicates the successful result of a verb/to touch/to come in contact with/to feel/to be affected by/to catch fire/to fall asleep/to burn, -ing part. (indicates an action in progress)/part. coverb-forming after some verbs, to make known/to show/to prove/to write/book/outstanding, to wear (clothes)/tocontact/to use/to apply42 下621185 28.121805202 xia4 under/second (of two parts)/next (week, etc.)/lower/below/underneath/down(wards)/to decline/to go down/latter43 自611687 28.4379159507 zi4 from/self/oneself/since44 之609003 28.752639648 zhi1 (literary equivalent of 的)/(subor. part.)/him/her/it45 年601887 29.0636859024 nian2 year46 过589925 29.3685503729 guo4 (experienced action marker)/to cross/to go over/to pass (time)/to celebrate (a holiday)/to live/to get along/(surname)/excessively/too-47 发572904 29.6646186437 fa1/fa4 to send out/to show (one's feeling)/to issue/to develop, hair48 后570764 29.9595809943 hou4 empress/queen/surname, back/behind/rear/afterwards/after/later49 作542791 30.2400873144 zuo4 to regard as/to take (somebody) for/to do/to make50 里537795 30.5180117759 li3inside/internal/interior, village/within/inside, Chinesemile/neighborhood/li, a Chinese unit of length = one-halfkilometer/hometown51 用535480 30.7947398798 yong4 to use52 道534695 31.0710623073 dao4direction/way/method/road/path/principle/truth/reason/skill/m ethod/Tao (of Taoism)/a measure word/to say/to speak/to talk53 行531848 31.3459134476 hang2/xing2/xing4 arow/profession/professional, all right/capable/competent/OK/okay/to go/to do/to travel/temporary/to walk/to go/will do, behavior/conduct54 所523028 31.6162065431 suo3 actually/place55 然511026 31.8802971833 ran2correct/right/so/thus/like this/-ly56 家509790 32.1437490771 jia1 furniture/tool,-ist/-er/-ian/home/family/a person engaged in a certain art or profession57 种503344 32.4038697739 zhong3/zhong4kind/type/race/breed/seed/species (taxonomy), to plant/to cultivate58 事499172 32.6618344431 shi4matter/thing/item/work/affair59 成499007 32.9197138428 cheng2/cheng4finish/complete/accomplish/become/turn into/win/succeed/one tenth, finish/complete/accomplish/become/turn into/win/succeed/one tenth60 方492763 33.1743664362 fang1square/quadrilateral/direction/just61 多481689 33.4232961509 duo1 many/much/a lotof/numerous/multi-62 经481338 33.672044474 jing1 classics/sacredbook/pass through/to undergo/scripture63 么477969 33.9190517481 ma/me/yao1 (interrog. part.), (interrog. suff.), one on dice/small64 去476270 34.1651810041 qu4 to go/to leave/to remove65 法466816 34.4064245736 fa3 law/method/way/Buddhist teaching/Legalist/France (abbrev.)66 学464261 34.646347757 xue2learn/study/science/-ology67 如449036 34.8784028867 ru2 as (if)/such as68 都439568 35.1055650948 dou1/du1 all/both (if two things are involved)/entirely (due to)each/even/already,(surname)/metropolis/capital city69 同437611 35.3317159543 tong2like/same/similar/together/alike/with70 现433960 35.5559800314 xian4appear/present/now/existing/current71 当429274 35.7778224533 dang1/dang4 to be/to actas/manage/withstand/when/during/ought/should/matchequally/equal/same/obstruct/just at (a time or place)/on thespot/right/just at, at or in the very same.../topawn/suitable/adequate/fitting/proper/replace/represent72 没428146 35.9990819415 mei2/mo4 (negative prefix for verbs)/have not/not, drowned/to end/to die/to inundate73 动426839 36.2196659916 dong4 to use/to act/to move/to change74 面425180 36.4393926952 mian4fade/side/surface/aspect/top/face/flour/noodles,flour/noodles75 起424933 36.6589917528 qi3 to rise/to raise/to get up76 看424616 36.8784269896 kan1/kan4 to look after/to take care of/to watch/to guard, it depends/think/to see/to look at77 定422538 37.0967883468 ding4 to set/to fix/to determine/to decide/to order78 天419884 37.3137781563 tian1 day/sky/heaven79 分419382 37.5305085396 fen1/fen4 todivide/minute/(a measure word)/(a unit of length = 0.33 centimeter), part 80 还415855 37.7454162218 hai2/huan2/huan4 also/in addition/more/still/else/still/yet/(not) yet, (surname)/payback/return81 进412166 37.9584174836 jin4 advance/enter/to come in82 好411866 38.1712637099 hao3/hao4 good/well, be fond of83 小410987 38.383655682 xiao3 small/tiny/few/young84 部403066 38.5919541991 bu4ministry/department/section/part/division/troops/board/(a measure word)/(a measure word for works of literature, films, machines, etc.)85 其403028 38.8002330784 qi2his/her/its/theirs/that/such/it (refers to sth preceding it) 86 些400528 39.0072199948 xie1 some/few/several/(a measure word)87 主399693 39.2137753956 zhu3 to own/tohost/master/lord/primary88 样398149 39.4195328802 yang4manner/pattern/way/appearance/shape89 理398087 39.6252583241 li3reason/logic/science/inner principle or structure90 心392228 39.8279559239 xin1 heart/mind91 她388612 40.028******* ta1 she92 本388118 40.2293584415 ben3 roots or stems ofplants/origin/source/this/the current/root/foundation/basis/(a measure word)93 前381431 40.4264763122 qian2 before/infront/ago/former/previous/earlier/front94 开377012 40.6213105094 kai1 open/operate(vehicle)/start95 但374459 40.8148253542 dan4but/yet/however/only/merely/still96 因371612 41.0068689116 yin1 cause/reason/because97 只370059 41.1981099018 qi2/zhi1/zhi3earth-spirit/peace, (a measure word, for birds and some animals, etc.)/single/only, M for one of a pair, only/merely/just/but, but/only 98 从369958 41.3892986966 cong1/cong2/zong4lax/yielding/unhurried, from/obey/observe/follow, second cousin99 想368819 41.5798988732 xiang3 to think/to believe/to suppose/to wish/to want/to miss100 实368494 41.7703310946 shi2real/true/honest/really/solid101 日363763 41.9583184056 ri4 Japan/day/sun/date/day of the month102 军362526 42.1456664533 jun1 army/military/arms103 者360974 42.3322124505 zhe3 -ist, -er(person)/person (who does sth)104 意360232 42.5183749931 yi4idea/meaning/wish/desire/(abbr.) Italy105 无359265 42.7040378045 wu2 -less/not tohave/no/none/not/to lack/un-106 力359136 42.8896339507 li4 power/force/strength 107 它346173 43.0685310111 ta1 it108 与345532 43.2470968122 yu2/yu3/yu4 (interrog. part.), and/to give/together with, take part in109 长342148 43.4239138125 chang2/zhang3length/long/forever/always/constantly, chief/head/elder/to grow/to develop110 把340730 43.5999980114 ba3/ba4 (a measure word)/(marker for direct-object)/to hold/to contain/to grasp/to take hold of, handle 111 机339823 43.7756134862 ji1machine/opportunity/secret112 十338954 43.9507798748 shi2 ten/10113 民335796 44.1243142558 min2 thepeople/nationality/citizen114 第325780 44.292672517 di4 (prefix before a number, for ordering numbers, e.g. "first", "number two", etc)115 公318626 44.4573336973 gong1 just/honorable (designation)/public/common116 此318534 44.6219473334 ci3 this/these117 已318143 44.7863589065 yi3 already/tostop/then/afterwards118 工317532 44.9504547239 gong1work/worker/skill/profession/trade/craft/labor119 使316273 45.1138999088 shi3 to make/to cause/to enable/to use/to employ/messenger120 情312900 45.2756019774 qing2feeling/emotion/passion/situation121 明309873 45.4357397375 ming2 clear/bright/tounderstand/next/the Ming dynasty122 性309844 45.5958625107 xing4sex/nature/surname/suffix corresponding to -ness or -ity123 知306384 45.7541972074 zhi1 to know/to be aware124 全305248 45.9119448362 quan2all/whole/entire/every/complete125 三304272 46.0691880827 san1 three/3126 又302933 46.2257393539 you4 (once)again/also/both... and.../again127 关302473 46.3820529039 guan1 (surname)/mountainpass/to close/to shut/to turn off/to concern/to involve128 点300800 46.5375018724 dian3 (downwards-right convex character stroke)/o'clock/(a measure word)/point/dot/(decimal) point) 129 正296810 46.6908888683 zheng1/zheng4 Chinese 1st month of year, just (right)/main/upright/straight/correct/principle 130 业296169 46.8439446048 ye4business/occupation/study131 外295645 46.9967295459 wai4 outside/inaddition/foreign/external132 将295586 47.1494839968 jiang1/jiang4 (will, shall, "future tense")/ready/prepared/to get/to use, a general133 两294600 47.3017288974 liang3 both/two/ounce/some/a few/tael134 高293542 47.4534270394 gao1 high/tall135 间292227 47.604445609 jian1/jian4between/among/space/(measure word), interstice/separate136 由292199 47.7554497085 you2 follow/from/it is for...to/reason/cause/because of/due to/by/to/to leave it (to sb) 137 问288237 47.9044063054 wen4 to ask138 很284252 48.0513035135 hen3 very/extremely139 最283689 48.1979097716 zui4 (the) most/-est140 重282843 48.3440788294 chong2/zhong4 to double/to repeat/repetition/iteration/again/a layer, heavy/serious141 并281616 48.4896137919 bing4 and/furthermore/(not)at all/simultaneously/also/together with/to combine/to join/to merge, amalgamate/combine, and/also/together with142 物281146 48.6349058654 wu4 thing/object/matter143 手280442 48.7798341221 shou3 hand/convenient144 应280384 48.9247324053 ying1/ying4 ought, (surname)/to answer/to respond145 战278776 49.068799698 zhan4 tofight/fight/war/battle146 向276986 49.2119419453 xiang4direction/part/side/towards/to/guide/opposite to,guide/opposite to147 头274911 49.3540118635 tou2/tou head, suff. for nouns148 文274222 49.4957257167 wen2language/culture/writing/formal/literary149 体273792 49.6372173523 ti3 body/form/style/system 150 政269878 49.7766862908 zheng4political/politics/government151 美269452 49.9159350789 mei3 America/beautiful152 相269125 50.0550148783 xiang1/xiang4 each other/one another/mutually, appearance/portrait/picture153 见269080 50.1940714223 jian4/xian4 to see/tomeet/to appear (to be sth)/to interview, appear154 被268905 50.333037529 bei4 by (marker forpassive-voice sentences or clauses)/quilt/blanket/to cover/to wear 155 利268813 50.4719560914 li4advantage/benefit/profit/sharp156 什267869 50.6103868086 shen2/shi2 what, tenth (used in fractions)157 二267506 50.7486299328 er4 two/2158 等266296 50.8862477471 deng3 class/rank/grade/equal to/same as/wait for/await/et cetera/and so on159 产265164 51.023******* chan3 to reproduce/to produce/give birth/products/produce/resources/estate/property160 或260786 51.1580508886 huo4maybe/perhaps/might/possibly/or161 新253921 51.2892734868 xin1 meso- (chem.)/new/newly 162 己250894 51.4189317764 ji3 6th heavenly stem/self 163 制250754 51.5485177161 zhi4 system/to make/to manufacture/to control/to regulate, manufacture164 身249725 51.6775718838 shen1body/torso/person/life/status/pregnancy/(a measure word used for clothes) suit165 果246831 51.8051304754 guo3 fruit/result166 加243648 51.9310441399 jia1 to add/plus167 西243619 52.0569428176 xi1 west168 斯240900 52.1814363565 si1 (phonetic)/this169 月240566 52.3057572892 yue4 moon/month170 话240067 52.4298203462 hua4 dialect/language/spoken words/speech/talk/words/conversation/what someone said171 合239277 52.5534751428 ge3/he2 one-tenth of a peck, Chinese musical note/fit/to join172 回239243 52.6771123688 hui2 (a measure word formatters or actions) a time/to circle/to go back/to turn around/to answer/to return/to revolve/Islam173 特239091 52.8006710434 te2/te4special/unusual/extraordinary, male animal/special (-ly)174 代231734 52.9204277298 dai4substitute/replace/generation/dynasty/geologicalera/era/age/period175 内231331 53.0399761518 nei4inside/inner/internal/within/interior176 信230248 53.1589648955 xin4 letter/true/tobelieve/sign/evidence177 表226768 53.2761552269 biao3 surface/exterior/to watch/to show/express/an example/a list or table/a meter/awatch/chart/external178 化224729 53.3922918334 hua4 to make into/to change into/-ization/to ... -ize/to transform179 老223050 53.5075607577 lao3 (a prefix used before the surname of a person or a numeral indicating the order of birth of the children in a family to indicate affection or familiarity)/old (of people) 180 给217815 53.6201243118 gei3/ji3 to/for/for the benefit of/to give/to allow/to do sth (for sb)/(passive particle), to supply/provide181 世214990 53.7312279479 shi4life/age/generation/era/world/lifetime182 位214983 53.8423279665 wei4position/location/(measure word for persons)/place/seat183 次214857 53.9533628702 ci4 nth/number (oftimes)/order/sequence/next/second(ary)/(measure word)184 度214376 54.0641492003 du4capacity/degree/standard185 门212769 54.1741050566 men2opening/door/gate/doorway/gateway/valve/switch/way to do something/knack/family/house/(religious) sect/school (ofthought)/class/category/phylum or division (taxonomy)186 任212485 54.2839141459 ren4 to assign/toappoint/office/responsibility187 常212309 54.3936322811 chang2always/ever/often/frequently/common/general/constant188 先210628 54.5024817004 xian1 early/prior/former/in advance/first189 海209302 54.6106458627 hai3 ocean/sea190 通209046 54.7186777279 tong1 go through/know well/to connect/to communicate/open191 教208875 54.8266212229 jiao1/jiao4 teach,religion/teaching192 儿207827 54.9340231271 er2/er son, non-syllabic dimi. suff.193 原207579 55.0412968686 yuan2former/original/primary/raw/level/cause/source194 东206238 55.1478776012 dong1 east195 声206083 55.2543782321 sheng1 sound/voice/(a measure word, used for sounds)/tone/noise196 提205159 55.3604013535 di1/ti2 carry (suspended), to carry/to lift/to put forward/(upwards character stroke)/lifting (brush stroke in painting)/to mention197 立204985 55.4663345544 li4 set up/to stand198 及202671 55.5710719144 ji2 to reach/and199 比200645 55.6747622677 bi3/bi4 (particle used forcomparison and "-er than")/to compare/to contrast/to gesture (with hands)/ratio, associate with/be near200 员200217 55.778231437 yuan2 person/employee/member 201 解200090 55.8816349746 jie3/jie4/xie4 to separate/to divide/to break up/to loosen/to explain/to untie/to emancipate, transport under guard, (surname)202 水198933 55.9844405918 shui3 water/river203 名198481 56.0870126221 ming2 name/(measure word for persons)/place (e.g. among winners)204 真198416 56.1895510614 zhen1 real/true/genuine205 论197165 56.2914430025 lun2/lun4 the Analects (of Confucius), by the/per/discuss/theory/to talk (about)/to discuss206 处194237 56.3918217967 chu3/chu4 to reside/tolive/to dwell/to be in/to stay/get along with/to be in a position of/deal with, a place/location/spot/point/office/department/bureau/respect 207 走193619 56.4918812177 zou3 to walk/to go/to move 208 义193507 56.5918827587 yi4justice/righteousness/meaning209 各193435 56.6918470913 ge4 each/every210 入192433 56.7912936051 ru4 to enter211 几192355 56.8906998097 ji1/ji3 small table, almost, a few/how many, how much/how many/several/a few212 口191936 56.9898894813 kou3 mouth/(a measure word) 213 认191866 57.0890429779 ren4 to recognize/to know/to admit214 条191280 57.1878936385 tiao2 measure word for long, thin things (i.e. ribbon, river, etc.)/a strip/item/article215 平191267 57.2867375808 ping2 flat/level/equal/to make the same score/to tie/to draw/calm/peaceful216 系190769 57.3853241642 xi4be/connection/relation/tie up/bind, be/system/totie/department/faculty, connect/to tie217 气190687 57.4838683711 qi4 air/anger/gas,gas/air/smell/weather/vital breath/to make sb. angry/to get angry/to be enraged218 题189921 57.5820167207 ti2 topic/subject/to inscribe/to superscribe219 活189876 57.6801418149 huo2 tolive/alive/living/work/workmanship220 尔189785 57.7782198817 er3 thus/so/likethat/you/thou221 更187378 57.8750540467 geng1/geng4 to change,more/even more/further/still/still more222 别186634 57.9715037235 bie2/bie4leave/depart/separate/distinguish/classify/other/another/do not/must not/to pin, contrary/difficult/awkward223 打186146 58.0677012092 da2/da3 dozen,beat/strike/break/mix up/build/fight/fetch/make/tieup/issue/shoot/calculate/since/from224 女185188 58.1634036147 nu:3/nv3 female/woman 225 变185121 58.2590713956 bian4 to change/to become different/to transform/to vary/rebellion226 四184874 58.3546115306 si4 four/4227 神184554 58.4499862943 shen2God/unusual/mysterious/soul/spirit/divineessence/lively/spiritual being228 总184470 58.5453176481 zong3 always/toassemble/gather/total/overall/head/chief/general/in every case229 何184335 58.6405792359 he2carry/what/how/why/which230 电183834 58.7355819144 dian4electric/electricity/electrical231 数183312 58.830314831 shu3/shu4/shuo4 t o count, number/figure/to count/to calculate/several, frequently/repeatedly 232 安183210 58.9249950355 an1content/calm/still/quiet/to pacify/peace233 少183018 59.0195760173 shao3/shao4 few/little/lack, young234 报182411 59.1138433105 bao4 to announce/toinform/report/newspaper/recompense/revenge235 才181725 59.2077560891 cai2ability/talent/endowment/gift/an expert/only (then)/onlyif/just, just/not until236 结181674 59.3016425116 jie1/jie2 knot/sturdy/to bear (fruit)/bond/to tie/to bind237 反181385 59.3953795833 fan3 wrong side out orup/anti-238 受180928 59.4888804841 shou4 to bear/to stand/to endure/(passive marker)/to receive239 目180827 59.5823291897 mu4eye/item/section/list/catalogue/table of contents/order (taxonomy)/goal/name/title240 太180490 59.6756037386 tai4 highest/greatest/too (much)/very/extremely241 量180008 59.7686291971 liang2/liang4 to measure, capacity/quantity/amount/to estimate242 再179607 59.8614474248 zai4 again/oncemore/re-/second/another243 感178383 59.9536331075 gan3 to feel/to move/to touch/to affect244 建178289 60.0457702124 jian4 to establish/to found/to set up/to build/to construct245 务176085 60.1367683228 wu4 affair/business/matter 246 做175580 60.2275054568 zuo4 to do/to make/to produce 247 接175473 60.3181872947 jie1 to extend/to connect/to receive/to join248 必174963 60.4086055722 bi4certainly/must/will/necessarily249 场173632 60.4983360087 chang3 a courtyard/openspace/place/field/a measure word/(a measure word, used for sport or recreation)250 件172895 60.5876855746 jian4 a measure word for thing, clothes, item251 计171949 60.6765462617 ji4 to calculate/to compute/to count/reckon/ruse/to plan252 管170746 60.7647852563 guan3 to take care (of)/to control/to manage/to be in charge of/to look after/to run/tube/pipe 253 期169806 60.852******* qi1 a period oftime/phase/stage/(used for issue of a periodical, courses ofstudy)/time/term/period/to hope254 市169522 60.9401449225 shi4 market/city255 直169439 61.027******* zhi2straight/vertical/frank/directly/straightly/upright256 德168917 61.1150022735 de2Germany/virtue/goodness/morality/ethics/kindness/favor/charac ter/kind257 资168890 61.2022821149 zi1 resources/capital/to provide/to supply/to support/money/expense258 命168358 61.2892870266 ming4 life/fate259 山168142 61.3761803127 shan1 mountain/hill260 金167912 61.4629547382 jin1 metal/money/gold261 指166865 61.5491880897 zhi3 finger/to point/to direct/to indicate262 克166044 61.6349971606 ke4 gram/subdue/torestrain/to overcome, subdue263 许166034 61.7208010637 xu3 to allow/to permit/to praise/(surname)264 统165820 61.8064943747 tong3 to gather/to unite/to unify/whole265 区164775 61.8916476453 ou1/qu1 Ou (surname),area/region/district/small/distinguish266 保164424 61.9766195243 bao3 to defend/to protect/to insure or guarantee/to maintain/hold or keep/to guard267 至163582 62.0611562702 zhi4 arrive/most/to/until 268 队162938 62.1453602064 dui4 squadron/team/group269 形162049 62.2291047207 xing2 to appear/tolook/form/shape270 社159527 62.3115459029 she4 society/group271 便159479 62.3939622794 bian4/pian2ordinary/plain/convenient/handy/easy/then/so/thus/to relieve oneself, advantageous/cheap272 空158539 62.4758928778 kong1/kong4air/sky/empty/in vain, emptied/leisure273 决157959 62.5575237409 jue2 breach (a dyke)/to decide/to determine274 治156904 62.6386093957 zhi4 to rule/to govern/to manage/to control/to harness (a river)/cure/treatment/to heal275 展156759 62.7196201166 zhan3 to use/to spread out/to postpone/to unfold276 马155772 62.8001207706 ma3 horse/horse chesspiece/Surname277 科155750 62.8806100553 ke1 branch ofstudy/administrative section/division/field/branch/stagedirections/family (taxonomy)/rules/laws/to mete out (punishment)/to levy (taxes, etc.)/to fine somebody278 司155229 62.960830095 si1 company/control279 五154316 63.0405783099 wu3 five/5280 基153462 63.1198851902 ji1base/foundation/basic/radical (chem.)281 眼152079 63.1984773567 yan3 eye282 书151357 63.2766964043 shu1 book/letter283 非149915 63.3541702478 fei1 non-/not-/un-284 则149042 63.4311929378 ze2 (expresses contrast with a previous sentence or clause)/standard/norm/rule/to imitate/to follow/then/principle285 听148988 63.5081877215 ting1/ting4listen/hear/obey, let/allow286 白148728 63.585048141 bai2white/snowy/empty/blank/bright/clear/plain/pure/gratuitous 287 却148679 63.661883238 que4but/yet/however/while/to go back/to decline/toretreat/nevertheless288 界148438 63.7385937898 jie4boundary/scope/extent/circles/group/kingdom (taxonomy)289 达147907 63.8150299287 da2 attain/passthrough/achieve/reach/realize/clear/inform/notify/dignity290 光147639 63.8913275692 guang1 light/ray/bright291 放147600 63.9676050551 fang4 to release/to free/to let go/to put/to place/to let out292 强146665 64.0433993469 qiang2strength/force/power/powerful/better293 即146459 64.1190871809 ji2 namely/right away/to approach/to draw near294 像145291 64.1941714099 xiang4 (look) like/similar (to)/appearance/to appear/to seem/image/portrait/resemble/seem295 难144597 64.26889699 nan2/nan4 difficult (to...)/problem/difficulty/difficult/not good, disaster/distress/to scold296 且144597 64.3436225702 qie3 further/moreover297 权144506 64.4183011228 quan2 authority/power/right 298 思144503 64.4929781251 si1 to think/to consider 299 王143948 64.5673683117 wang2 king/Wang (proper name) 300 象143453 64.6415026896 xiang4shape/form/appearance/elephant/image under a map (math.)301 完143120 64.7154649781 wan2 to finish/to beover/whole/complete/entire302 设142742 64.7892319218 she4 to set up/to arrange/to establish/to found/to display303 式142292 64.8627663122 shi4 type/form/pattern/style 304 色142231 64.9362691787 se4/shai3color/look/appearance, color/dice305 路141812 65.0095555122 lu4 (surname)/road/path/way 306 记139501 65.0816475552 ji4 to remember/tonote/mark/sign/to record307 南139078 65.1535209982 nan2 south308 品138528 65.2251102093 pin3conduct/grade/thing/product/good309 住138503 65.2966865008 zhu4 to live/to dwell/to reside/to stop310 告138285 65.3681501332 gao4 to tell/to inform/to say 311 类138280 65.4396111816 lei4kind/type/class/category/similar/like/to resemble312 求138216 65.5110391558 qiu2 to seek/to look for/to request/to demand/to beseech313 据137990 65.5823503365 ju1/ju4 sickness of hand, act in accordance with/seize, according to/to act in accordance with/to depend on/to seize/to occupy314 程136962 65.6531302621 cheng2rule/order/regulations/formula/journey/procedure/sequence/a surname315 北136939 65.7238983017 bei3 north316 边136277 65.7943242295 bian1side/edge/margin/border/boundary317 死136194 65.8647072641 si3 todie/inpassable/uncrossable/inflexible/rigid318 张136087 65.9350350027 zhang1 (a measure word)/(a surname)/open up319 该136053 66.0053451707 gai1 that/theabove-mentioned/most likely/to deserve/should/ought to/owe320 交135727 66.0754868666 jiao1 to deliver/to turnover/to make friends/to intersect (lines)/to pay (money)321 规134528 66.1450089372 gui1 compass/rule。

现代汉语连续口语语音语料库-现代汉语自然语音语料库

现代汉语连续口语语音语料库-现代汉语自然语音语料库

2008/12/05
文獻語料庫─自然語音語料收集與處理工作坊
35
大綱
語料蒐集 語料處理與標記 語料分析與應用
2008/12/05
文獻語料庫─自然語音語料收集與處理工作坊
34
元音圖(朗讀語料)
2008/12/05
文獻語料庫─自然語音語料收集與處理工作坊
36
元音圖(「到」/tau/)
2008/12/05
2008/12/05
文獻語料庫─自然語音語料收集與處理工作坊
25
標記系統
1. 語音部分標記系統
特殊音韻現象 無法或難以辨識的語音 不順暢的語流 受其他方言或言語影響
2. 非語音部分標記系統
人聲:非語音但確定由人所發出的聲音,例 如: 笑聲、咳嗽聲、呼吸聲…。
非人聲:室內雜音。
2008/12/05
文獻語料庫─自然語音語料收集與處理工作坊
29
語音標記軟體 - Praat (3/3)
語音標記軟體 - Praat (2/3)
►功 能 : 1. 語音的採集、分析&標記; 2. 合成語音; 3. 擷取聲學參數; 4. 可編寫script增加功能。
2008/12/05
文獻語料庫─自然語音語料收集與處理工作坊
現代漢語自然語音語料庫
大綱
語料蒐集 語料處理與標記 語料分析與應用
2008/12/05
文獻語料庫─自然語音語料收集與處理工作坊
3
語料庫簡介
語料內容
收集時間 語料長度
現代漢語連續 口語對話語音語料庫
(Mandarin Conversational Dialogue
Corpus, MCDC)

附录一现代汉语语料库词频统计资料库说明

附录一现代汉语语料库词频统计资料库说明
3. 基本字頻統計數表:這一種數表的功用是把某一語料單元中所含的所有漢 字的基本字頻資料列明。所列的資訊項目包括有本字(即單字)、序號、所 屬部首、筆劃、頻次、頻率、累積頻次、累積頻率、見檔次和見檔率等, 其中本字、序號和頻次三者為原計劃所有(名稱略有不同),其餘則由網頁 新訂。
4. 見檔率:「頻率」的計算,是以某一單字於某一語料單元中出現的頻次除以 該單元的總字數,再換成百分比而構成。「見檔率」的計算,則以某一首字 於某一語料單元中的「見檔次」(即含有該首字的檔案的數目)除以該語料 單元的總檔數,再換成百分比而得出。
是3
9 9755 1.483% 50634 7.697% 319 99.69%
不4
4 8359 1.271% 58993 8.968% 317 99.06%
人5
2 7107 1.080% 66100 10.05% 319 99.69%
在6
6 6931 1.054% 73031 11.10% 319 99.69%
75
附錄三 初級學習者「的地不分」之偏誤例句
1) 嚴格的說,我也沒有很用力的大他,只是為了管教起見,輕輕的打他一下。 2) 在那個時候,小黃很明顯的不高興。 3) 在電影裏,小慧的夫婚夫,也就是開小巴的阿文在一場車禍中意外的喪生
了。 4) 請你們要仔細的聽,免得你們等一下不懂。 5) 電腦對人類有很大的貢獻,所以我們絕對要徹底的瞭解電腦的結構。 6) 繼續的讀下去了 7) 我們應該好好的保護它 8) 從那天起我才能正式的由學校的老師慢慢的教導 9) 都很積極的學中文 10) 就算是再困難再難懂也要努力的去突破­ 11) 雖然沒有像台灣的學生那麼的吃香 12) 有一回,母親心血來潮的問我 13) 這時母親疑惑的看著我: 14) 加上老师有声有色的讲述后 15) 深深的烙印在他們的心中 16) 辛苦的照顧她的小孩時 17) 媽媽是那麼細心、辛苦的照顧我 18) 辛勤的在田裡工作時 19) 小女孩卻高高興興的吃飯 20) 媽媽會毫不留情的拿起橡皮擦「嚄嚄」兩下 21) 儘管我們力竭聲嘶的吶喊 22) 就這樣含淚吞苦的寫了一年半多 23) 而我也可以毫無困難的寫出來時 24) 現在我終於可以很大聲很驕傲的說「我是個十足十會中文的台灣人了!」

语料库语言学术语汇编Aglossaryofcorpuslinguistics.docx

语料库语言学术语汇编Aglossaryofcorpuslinguistics.docx

语料库语言学术语汇编 ( V2.0 )Last updated 2012-10-08 by许家金Aboutness所言之事Absolute frequency绝对频数Alignment (of parallel texts)(平行或对应)语料的对齐Alphanumeric字母数字构成的Annotate标注(动词)Annotated text/corpus标注文本 /语料库、赋码文本/语料库Annotation标注(名词)Annotation scheme标注方案ANSI/American National Standards Institute美国国家标准学会ASCII/American Standard Code for Information美国信息交换标准码ExchangeAssociates (of keywords)(主题词的)联想词AWL/academic word list学术词表Balanced corpus平衡语料库Base list/baselist底表、基础词表Bigram二元组、二元序列、二元结构Bi-text/bitext双语合并文本、双语分行对齐文本(一句源语一句目标语对齐后的文本)Bi-hapax两次词Bilingual corpus双语语料库Bootcamp debate/discourse/discussion(新手)训练营大辩论 /话语 /大探讨CA/Contrastive Analysis对比分析Case-sensitive/case sensitivity大小写敏感、区分大小写Category-based approach基于类(范畴)的方法Chi-square test/ 2χ卡方检验Chunk词块CIA/Contrastive Interlanguage Analysis中介语对比分析CLAWS/Constituent Likelihood Automatic Word-CLAWS 词性赋码系统tagging SystemClean text policy干净文本原则Cluster词簇、词丛Colligation类联接、类连接、类联结Collocate n./v.搭配词;搭配Collocability搭配强度、搭配力Collocation搭配、词语搭配Collocational strength搭配强度Collocational framework/frame搭配框架Collocational profile搭配概貌Collocational network搭配网络Comparable corpora类比语料库、可比语料库Computational Linguistics计算语言学ConcGram/concgram同现词列、框合结构Concord索引(行)(简略形式)Concordance (line)索引(行)Concordance plot(索引)词图Concordancer索引工具Concordancing索引分析Context语境、上下文Context word语境词Contextual prosody语境韵律Contingency table连列表、联列表、列连表、列联表Co-occurrence/Co-occurring共现、同现Corpus Linguistics语料库语言学Corpus, pl. corpora语料库Corpus-based基于语料库的Corpus-based translation studies基于语料库的翻译研究、语料库翻译学、基于语料库的译学研究Corpus-driven语料库驱动的Corpus-informed语料库指导下的、参考了语料库的Corpus size库容Corpus stylistics语料库文体学Co-select/co-selection/co-selectiveness共选(机制)Co-text共文Data mining数据挖掘DDL/Data Driven Learning数据驱动学习Dependency(句法)依存关系Dice coefficient Dice 系数Disambiguation消歧Diachronic corpus历时语料库Discourse话语、语篇Discourse prosody话语韵律Documentation文检报告、备检文件、说明文档EAGLES/Expert Advisory Groups on Language EAGLES 文本规格Engineering StandardsEmpirical linguistics实证语言学Empiricism经验主义Encoding字符编码Error-tagging错误标注、错误赋码Explicitation显化Extended unit of meaning扩展意义单位File-based search/concordancing批量检索Firthian (linguistics)弗斯(语言学)、弗斯学派的(语言学)Formulaic sequence程式化序列、套语Frequency频数、频率Frequency list词频表General (purpose) corpus通用语料库Genre语体、体裁Grammatical patterning语法型式Granularity颗粒度Hapax legomenon/hapax一次词Header/corpus head文本头、头标、头文件Hidden Markov model (HMM)隐马尔科夫模型、隐马模型Idiom principle习语原则、成语原则Idiomaticity习语性、地道程度Implicitation隐化Index/indexing(建)索引In-line annotation文内标注、行内标注Interlanguage中介语、过渡语Inter-coder agreement/reliability标注者间一致性/信度Introspection/introspective内省(式)(的)Intuition直觉Key keywords关键主题词Keyness主体性、关键性Keywords主题词KWIC/Key Word in Context语境中的关键词、语境共现(方式)KWIC sort语境共现排序、索引行排序Learner corpus学习者语料库Lemma, pl. lemmata/lemmas词目、原形词、词元Lemmatization词形还原、词元化Lemmatizer词形还原工具、词元化工具Lexical bundle词束Lexical density词汇密度Lexical frequency profile词频概貌Lexical grammar词汇语法Lexical item词项、词语项目Lexical patterning词语型式、词汇型式Lexical priming词汇触发理论、词汇启动理论Lexical profile词汇分布概貌Lexical richness词汇丰富度Lexico-grammar词汇语法Lexis词语、词项、词语学Log-likelihood ratio对数似然比、对数似然率Longitudinal/developmental corpus跟踪语料库、发展语料库、历时语料库Machine-readable机读的Machine translation机器翻译Manual annotation手工标注Markup/mark-up标记、置标MDA (Multi-dimensional analysis/approach)多维度分析法Metadata元信息Meta-metadata元元信息MF/MD approach/multi-feature/multi-dimensional多特征/多维度分析法analysisMisuse误用Monitor corpus(动态)监察语料库Monolingual corpus单语语料库Multilingual corpus多语语料库Multimodal corpus多模态语料库MWU/multiword unit多词单位MWE/multiword expression多词表达MI/mutual information互信息、互现信息N-gram N 元组、 N 元序列、 N 元结构、 N 元词、多词序列Neo-Firth (school)新弗斯学派Neo-Firthian新弗斯学派的NLP/Natural Language Processing自然语言处理Node (word)节点(词)Normalization标准化、(翻译)规范化、泛化Normalized frequency标准化频率、标称频率、归一频率Observed corpus观察语料库Ontology知识本体、本体Open choice principle开放选择原则OrthographicOrthography正字法Overuse过多使用、超用、使用过度、过度使用Paradigmatic纵聚合(关系)的Parallel corpus平行语料库、对应语料库Parole linguistics言语语言学Parsed corpus句法标注的语料库、树库Parser句法分析器Parsing句法标注、句法分析Pattern/patterning型式、模式Pattern grammar型式语法Pattern matching模式匹配Pedagogic corpus教学语料库Phraseology短语、短语学Phraseological unit/sequence短语单位 /序列Plain text纯文本POSgram赋码序列、码串POS sequence赋码序列、码串POS tagging/Part-of-Speech tagging词性赋码、词性标注、词性附码POS tagger词性赋码器、词性赋码工具Prefab预制语块Probabilistic(基于)概率的、概率性的、盖然的Probabilistic grammar概率语法、概率性语法、盖然语法Probability概率Query查询、检索Range分布(范围)、跨度Rationalism理性主义Raw frequency原始频数、生频数Raw text/corpus生文本 /生语料Reference corpus参照语料库Regex/RE/RegExp/regular expressions正则表达式、正则式Register variation语域变异Relative frequency相对频率Representative/representativeness代表性(的)Rule-based基于规则的S-universals源语型共性(特征)Sample n./v.样本;取样、采样、抽样Sampling取样、采样、抽样Sanitization净化Search term检索项Search word检索词Segmentation切分、分词Semantic association语义联想Semantic preference语义倾向、语义趋向Semantic prosody语义韵Sentence alignment句对齐、句级对齐SGML/Standard Generalized Markup Language标准通用标记语言Simplification简化Skipgram跨词序列、跨词结构Span跨距Specialized corpus专用语料库、专门用途语料库、专题语料库Standardized type/token ratio标准化类符 /形符比、标准化类/形比、标准化型次比Standardized TTR/STTR标准化类符 /形符比、标准化类/形比、标准化型次比Stand-off annotation分离式标注Stochastic随机的Stop list停用词表、过滤词表Stop word停用词、过滤词Synchronic corpus共时语料库Syntagmatic横组合(关系)的T score T 值T-universals目标语型共性(特征)Tag赋码、标记、附码Tagger赋码器、赋码工具、标注工具Tagging赋码、标注、附码Tag sequence赋码序列、码串Tagset赋码集、码集Tertium comparationis对比中立项、对比基础Text文本Text type文体、文类Text category文体、文类Text mining文本挖掘TEI/Text Encoding Initiative TEI 文本编码计划The Lexical Approach词汇中心教学法The Lexical Syllabus词汇大纲Token形符、词次Token definition/word definition形符界定、单词界定Tokenization分词Tokenizer分词工具Transcription转写Translation memory翻译记忆(库)Translation norms翻译规范Translationuniversals/Universal features of 翻译共性、翻译普遍特征translationTranslational corpus翻译语料库Translationese翻译体、翻译腔Treebank树库Trigram三元组、三元序列、三元结构T-score T 值Type类符、词种、词型TTR类符 /形符比、类 /形比、型次比Type/token ratio类符 /形符比、类 /形比、型次比Underuse少用、使用不足Unicode通用码Unicodify按通用码编码、转换为通用码Unit of meaning意义单位WaC/Web as Corpus网络语料库、网库Wildcard通配符Word alignment词对齐、词级对齐Word form词形Word family词族Word list词表Word sketch词语素描WSD/Word-sense disambiguation词义消歧XML/Extensible Markup Language可扩展标记语言Zipf ’ s Law/Zipfian Law齐夫定律Z score Z 值常用语料库ACE Australian Corpus of EnglishANC American National CorpusARCHER A Representative Corpus of Historical English Registers BASE British Academic Spoken English CorpusBAWE British Academic Written English CorpusBNC British National CorpusBoE Bank of EnglishBrown Brown CorpusCANCODE Cambridge and Nottingham Corpus of Discourse in English CEC China English CorpusCEM Corpus for English MajorsCHILDES Child Language Data Exchange SystemCIC Cambridge International CorpusCLEC Chinese Learners English CorpusCLOB2009 Brown family corpus of British EnglishCOBUILD Collins Birmingham University International Language Database COCA The Corpus of Contemporary American EnglishCOLSEC College Learners Spoken English CorpusCOLT Bergen Corpus of London Teenage LanguageCrown2009 Brown family corpus of American EnglishFLOB Freiburg-LOB Corpus of British EnglishFROWN Freiburg-Brown Corpus of American EnglishHelsinki Diachronic part of the Helsinki Corpus of English Texts DiachroniccorpusHKCSE Hong Kong Corpus of Spoken EnglishICE International Corpus of EnglishICE-GB International Corpus of English: Great BritainICLE International Corpus of Learner EnglishJEFLL Japanese EFL Learner CorpusLCMC Lancaster Corpus Mandarin ChineseLINDSEI Louvain International Database of Spoken English Interlanguage LIVAC Linguistic Variations in Chinese Speech CommunitiesLLC London Lund CorpusLOB Lancaster-Oslo/Bergen CorpusLOCNESS Louvain Corpus of Native English EssaysLONGDALE LONGitudinal DAtabase of Learner EnglishMICASE Michigan Corpus of Academic Spoken EnglishMICUSP Michigan Corpus of Upper-level Student PapersNESSIE Native English Speakers ’Similarly and Identically-prompted EssaysPACCEL Parallel Corpus of Chinese EFL LearnersSBCSAE Santa Barbara Corpus of Spoken American EnglishSCCSD The Spoken Chinese Corpus of Situated DiscourseSCORE Singapore Corpus of Research in EducationSEC Spoken English CorpusSECCL Spoken English Corpus of Chinese LearnersSECOPETS Spoken English Corpus of Public English Test SystemSEU Survey of English UsageSWECCL Spoken and Written English Corpus of Chinese Learners WECCL Written English Corpus of Chinese LearnersLast updated 2012-08-08 by许家金。

古代汉语语料库字频表ACCorpusCharacterlist

古代汉语语料库字频表ACCorpusCharacterlist

0.1407 0.1394 0.1384 0.1375 0.1374 0.1368 0.1351 0.1345 0.1341 0.1335 0.1332 0.1332 0.1327 0.132 0.1313 0.1296 0.1295 0.1295 0.1294 0.1294 0.1294 0.1285 0.1272 0.1267 0.1257 0.1253 0.1251 0.1249 0.1246 0.1244 0.1243 0.1239 0.1231 0.1224 0.1223 0.1222 0.122 0.1216 0.1216 0.1201 0.1187 0.1184 0.1179 0.1179 0.1176 0.1173 0.116
24.1426 24.4664 24.7825 25.0982 25.4113 25.7161 26.0146 26.3107 26.6018 26.8859 27.1671 27.4433 27.7183 27.99 28.2611 28.5305 28.7992 29.0664 29.333 29.5955 29.8577 30.1158 30.3647 30.6124 30.8574 31.1018 31.3407 31.5778 31.8143 32.0505 32.2829 32.5138 32.7447 32.9746 33.204 33.4314 33.6584 33.8832 34.1057 34.3275 34.5477 34.7528 34.9574 35.1618 35.3595 35.554 35.748
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134

5000词频表word list

5000词频表word list

258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
Part of speech词性 a v c i a i t v i p p c i p p i i v v d p i c p a i d x x i p c c d v a v p v c v a
Frequency频率 22038615 12545825 10741073 10343885 10144200 6996437 6332195 4303955 3856916 3872477 3978265 3430996 3281454 3081151 2909254 2683014 2485306 2573587 1915138 1885366 1865580 1767638 1776767 1820935 1801708 1635914 1712406 1638830 1619007 1490548 1484869 1379320 1296879 1181023 1151045 1083029 1022775 1018283 992596 933542 925515 969591
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214

[转载]语料库工具箱用户指南(ACWT)

[转载]语料库工具箱用户指南(ACWT)

[转载]语料库⼯具箱⽤户指南(ACWT)原⽂地址:语料库⼯具箱⽤户指南(ACWT)作者:gjxyxkgy主页:/alc/chinese/ACWT/ACWT.htm软件下载:/alc/chinese/ACWT/ACWT.zip1.什么是“语料库⼯具箱”ACWT?语料库⼯具箱(ACWT)是指嵌⼊到⽂字处理软件NoteTab中的⼀组模块(clips),Perl代码及其他⼀些中英⽂⽂本处理⼯具。

这些⼯具可以帮助处理⼀些通常需要昂贵复杂的商业软件才能实现的“⼜脏⼜累”的语料库和话语分析、处理⼯作。

⽬前“⼯具箱”中主要包括以下组件:Text Utilities⽂本处理Merge Files档合并HTML<-->Text Conversion HTML-TXT格式相互转换Tagged Text-->Plain Text Conversion去除标注⽂本中的标记File comparison/sizes/counts⽂本⽐较/檔⼤⼩/字数统计/分割/和并Chinese Character Spacing/Word Segmentation/POS Tagging汉字分字/分词/词性标注Search&Analysis检索统计Basic Chinese Concordance简单汉语检索Basic English Concordance简单英语检索Word List/Frequency词表/词频表Mutual Info/T-Score/Z-Score/Log-likelihood互现信息/T值/Z值/Log-likelihoodNormed Freq/Ratio/Lexical Density常态化频率/型次⽐/词汇密度Interactive Text Tagging互动加码L2 Errors–The CLEC Tags⼆语学习者错误代码—CLEC赋码集Discourse Structure–Samples话语结构标注—样例Semantics&Pragmatics–Samples语义语⽤标注—样例Sociolinguistics–Samples社会语⾔学标注—样例Syntax–Samples句法标注—样例Discourse Transcription⼝语转写The Du Bois System-DuBois最新⼝语转写体系-2005-08Header Info头⽂件信息Voice Quality⾳质Turn Taking话轮转换Conversation Structure会话结构Metalinguistic元语⾔特征2.安装要运⾏这些组件需要安装4.5及以上版本的NoteTab⼯具,Perl(译码)程序,以及下⾯提及的相关⼯具。

(完整word版)现代汉语常用词表

(完整word版)现代汉语常用词表

现代汉语常用词表(草案)现代汉语常用词表(草案)1.范围本规范(草案)提出了现当代社会生活中比较稳定的、使用频率较高的汉语普通话常用词语56 008个,形成《现代汉语常用词表》,给出了词语的词形。

本规范(草案)可供中小学语文教学、扫盲教育、汉语教育、中文信息处理和辞书编纂等方面参考、采用。

2.术语和定义2.1 常用词现代汉语普通话范畴中使用频率高、适用范围广的词语。

2.2 词形本规范(草案)指词语的书写形式。

2.3 词频在一定数量的语料中同一个词语出现的频度,一般用词语的出现次数或覆盖率来表示。

本规范(草案)指词语的出现次数。

2.4 频级同一语料调查范围中词频数相同的为一个频级。

本词表频级统计分两步:第一步形成不同类型语料库的频级,即原始频级。

第二步形成总语料的频级,就是将几个原始频级之和再除以不同类型语料库的数目。

3.研制原则3.1 词和语兼顾原则本规范(草案)的词语收录,以单音节词和双音节词为主。

同时,根据语言使用的实际情况,也收录一些常用的缩略语、成语、惯用语等熟语,以及表达整体概念名称的其他固定短语。

3.2 系统性和实用性兼顾原则本规范(草案)的词语收录,既注意词语的系统性,又注意词语在语用中的实用性。

如以表示季节为根词的一些词,全部收录它的加缀词“初春”“初冬”“初秋”“初夏”,但对于“‘晚’+季节”的词语,只收录“晚春”“晚秋”,未收录“晚冬”“晚夏”;对于“‘残’+季节”的词语,只收录“残冬”未收录“残春”“残秋”“残夏”。

4.《现代汉语常用词表》(草案)说明4.1 本表研制过程中,收集词语同国家语委“现代汉语通用语料库”核心语料库、厦门大学的新词语语料库、《现代汉语规范词典》、《现代汉语词典》、《新华词典》等所收词语进行了比对,并查验了该词在人民网《人民日报》报系网页以及Google网简体中文网页、百度网等常用网页上的使用情况。

4.2本表用来检测词频的语料库有:国家语委“现代汉语通用语料库”中经分词标注的4 500万字语料、《人民日报》2001年~2005年约1.35亿字的分词标注语料和厦门大学的现当代文学作品语料库约7 000万字的语料。

第一、二批异形词整理表

第一、二批异形词整理表

《第一批异形词整理表》中华人民共和国教育部国家语言文字工作委员会发布(2002年3月31日试行)1 范围本规范是推荐性试行规范。

根据“积极稳妥、循序渐进、区别对待、分批整理”的工作方针,选取了普通话书面语中经常使用、公众的取舍倾向比较明显的338组(不含附录中的44组)异形词(包括词和固定短语),作为第一批进行整理,给出了每组异形词的推荐使用词形。

本规范适用于普通话书面语,包括语文教学、新闻出版、辞书编纂、信息处理等方面。

2 规范性引用文件第一批异体字整理表(1955年12月22日中华人民共和国文化部、中国文字改革委员会发布)汉语拼音方案(1958年2月11日中华人民共和国第一届全国人民代表大会第五次会议批准)普通话异读词审音表(1985年12月27日国家语言文字工作委员会、国家教育委员会和广播电视部发布)简化字总表(1986年10月10日经国务院批准国家语言文字工作委员会重新发表)现代汉语常用字表(1988年1月26日国家语言文字工作委员会、国家教育委员会发布)现代汉语通用字表(1988年3月25日国家语言文字工作委员会、中华人民共和国新闻出版署发布)GB/T 16159-1996 汉语拼音正词法基本规则3 术语3.1 异形词(variant forms of the same word)普通话书面语中并存并用的同音(本规范中指声、韵、调完全相同)、同义(本规范中指理性意义、色彩意义和语法意义完全相同)而书写形式不同的词语。

3.2 异体字(variant forms of a Chinese character)与规定的正体字同音、同义而写法不同的字。

本规范中专指被《第一批异体字整理表》淘汰的异体字。

3.3 词形(word form/lexical form)本规范中指词语的书写形式。

3.4 语料(corpus)本规范中指用于词频统计的普通话书面语中的语言资料。

3.5 词频(word frequency)在一定数量的语料中同一个词语出现的频度,一般用词语的出现次数或覆盖率来表示。

COCA语料库,词频排名前20000

COCA语料库,词频排名前20000

COCA语料库,词频排名前20000 NO.记住了隐藏所有单词隐藏⾳标发⾳隐藏解释例句1consultant[kənˈsʌltənt]n.顾问;会诊医师,专科医⽣例句2weigh[weɪ]vt.秤重量,衡量;重压;下锚;vi.有重量,有影响;重压;下锚例句3absence[ˈæbsəns]adj.不在,缺席;n.缺乏,缺少,缺席例句4specialist[ˈspeʃəlɪst]n.(医学)专家,专科医⽣;专家;专业⼈员例句5criteria[kraɪ'tɪərɪə]n.标准例句6snap[snæp]n.啪地移动,突然断掉;v.猛咬,咬断,谩骂,砰然关上例句7label[ˈleɪbl]n.卷标,标签,标记;称号,绰号;v.贴标签于,把…称为例句8sweep[swi:p]n.扫除,打扫;肃清;视野,范围;全胜;vt.扫除;⽤⼿指弹,猛拉;扫荡;肃清;冲⾛例句9react[riˈækt]vi.起反应,起作⽤;反抗,起反作⽤例句10infection[ɪnˈfekʃn]n.传染,影响,传染病例句11administrator[ədˈmɪnɪstreɪtə(r)]n.经营管理者,⾏政官员例句12occasionally[əˈkeɪʒnəli]adv.偶然地;⾮经常地例句13mayor[meə(r)]n.市长例句14consideration [kənˌsɪdəˈreɪʃn]n.考虑,思考;要考虑的事;体谅,关⼼例句15CEO[ˌsi:i:'əʊ]n.chief executiveofficer⾸席执⾏官例句16secure[sɪˈkjʊə(r)]adj.⽆虑的,安全的;牢靠的,稳妥的;vt.固定,获得,使...安全;vi.(海上⼯作⼈员)停⽌⼯;(船)抛锚,停泊例句17pink[pɪŋk]adj.粉红⾊的,桃红⾊的;n.粉红⾊,桃红⾊例句18buck[bʌk]n.雄⿅,雄兔;v.马离地跳跃例句19historic[hɪˈstɒrɪk]adj.历史上著名的,具有历史意义的例句20poem[ˈpəʊɪm]n.诗;韵⽂;诗⼀样的作品;富有诗意的东西例句。

现代汉语汉字频率表

现代汉语汉字频率表

0.1454 0.1436 0.1434 0.143 0.1409 0.1402 0.1389 0.1389 0.1388 0.1386 0.1384 0.1381 0.1362 0.1362 0.1357 0.1353 0.1339 0.131 0.1308 0.1277 0.1247 0.1243 0.1241 0.124 0.1234 0.1228 0.1228 0.1227 0.1207 0.1206 0.1204 0.1202 0.1199 0.1198 0.1196 0.1194 0.1179 0.1179 0.1177 0.1172 0.117 0.1164 0.1163 0.1154 0.1148 0.1144 0.1135 0.1133 0.1126 0.1113 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146
0.1997 0.1978 0.1968 0.1958 0.1958 0.1952 0.1936 0.1925 0.1892 0.1889 0.1887 0.1884 0.1841 0.1828 0.1808 0.1806 0.1793 0.1791 0.1783 0.1767 0.1736 0.1716 0.1708 0.1671 0.1662 0.1627 0.1626 0.1623 0.1615 0.1603 0.1601 0.16 0.1579 0.1577 0.1575 0.1574 0.1564 0.156 0.1556 0.1538 0.1536 0.1509 0.1506 0.1504 0.149 0.1489 0.1484 0.1484 0.1477 0.1475 0.1469

4、计算语言学资源-语料库

4、计算语言学资源-语料库

中文语料库
汉语现代文学作品语料库
(1979年,武汉大学,527万字)
现代汉语语料库
(1983年,北航,2000万字)
中学语文教材语料库
(1983年, 北师大,106万字)
现代汉语词频统计语料库
(1983年,北京语言学院,182万字)
中文语料库
Chinese LDC:国家973 项目资助
average about 2.1 meanings 频率等级: 5000(sqrt=70.7) average about 3 meanings 频率等级: 2000(sqrt=44.7) average about 4.6 meanings
词频和词长
词频和词长是反比例关系
短词经常被使用
“in”, “of”, …...
语料库语言学
语料库研究
收集:
收集大规模真实文本,建设平衡语料库。
加工:
对语料库进行各级语言单位的语言学信息标注。如词法、句法、语 义、语用、篇章层。
标注技术:分词、词性标注、句法标注、语义标注等。
统计:
对语料库进行各级语言单位的统计。
模型化:
根据语料库的统计,对相关的语言问题,构造统计模型。
最高频的100个词覆盖了全部词汇出现次数的一半 一半的词汇在语料库中只出现一次 90%的词形出现10次或更少 文本中的12% 是出现3次或者更少的词
很难预测那些很少出现或者干脆在语料库中从 未出现的词的行为
齐普夫定律
省力原则
讲者希望:使用最少的词汇,没有标点空格 听者希望:使用较多的词汇,丰富的标记
语料库的类型
按加工深度划分
单语语料库
切分 具有词性标注 句法结构信息标注(树库) 语义信息标注

语料库术语表

语料库术语表

Absolute frequency Alignment (of parallel texts) Alphanumeric Annotate Annotation Annotation schemeANSI/American National Standards Institute 绝对频数(平行或对应)语料的对齐字母数字类的 标注(动词) 标注(名词) 标注方案美国国家标准学会 ASCII/American StandardInformation Exchange Associate (of keywords) AWL/Academicwordlist Balanced corpus Base list Bigram Bi-hapax Bilingual corpus CA/Contrastive Analysis Case-sensitive Chi-square ( χ 2) test ChunkCIA/Contrastive Interlanguage Analysis Codefor 美国信息交换标准码(主题词的)联想词学术词表 平衡语料库 底表、基础词表二元组、二元序列、二元结构两次词 双语语料库对比分析大小写敏感、区分大小写卡方检验 词块中介语对比分析 CLAWS/Constituent Likelihood AutomaticCLAWS 词性赋码系统 Word-tagging System Clean text policy Cluster Colligation Collocate n./v. Collocability Collocation Collocational strength Collocational framework/frame Comparable corpora ConcGram Concordance (line) Concordance plot Concordancer Concordancing Context Context word Contingency table Co-occurrence/Co-occurring CorporaCorpus Linguistics干净文本原则词簇、词丛类联接、类连接、类联结搭配词;搭配 搭配强度、搭配力搭配、词语搭配搭配强度 搭配框架类比语料库、可比语料库同现词列、框合结构 索引(行) (索引)词图索引工具 索引生成、索引分析语境、上下文 语境词连列表、联列表、列连表、列联表共现语料库(复数) 语料库语言学CorpusCorpus-basedCorpus-drivenCorpus-informedCo-select/Co-selection/Co-selectivenessCo-textDDL/Data Driven LearningDiachronic corpusDiscourseDiscourse prosodyDocumentationEAGLES/Expert Advisory Groups Language Engineering Standards Empirical LinguisticsEmpiricismEncodingError-taggingExtended unit of meaningFile-based search/concordancing Formulaic sequenceFrequencyGeneral (purpose) corpusGranularityHapax legomenon/hapaxHeader/Text headHMM/Hidden Markov ModelIdiom PrincipleIndex/IndexingIn-line annotationKey keywordKeynessKeywordKWIC/Key Word in ContextLearner corpusLemmaLemma listLemmataLemmatizationLemmatizerLexical bundleLexical densityLexical itemLexical priming语料库基于语料库的语料库驱动的语料库指导的、参考了语料库的共选(机制)共文数据驱动学习历时语料库话语、语篇话语韵律备检文件、文检报告on EAGLES 文本规格实证语言学经验主义字符编码错误标注、错误赋码扩展意义单位批量检索程式化序列频数、频率通用语料库颗粒度一次词文本头、头标、头文件隐马尔科夫模型习语原则(建)索引文内标注、行内标注关键主题词主题性、关键性主题词语境中的关键词、语境共现(方式)学习者语料库词目、原形词、词元词形还原对应表词目、原形词、词元(复数)词形还原、词元化词形还原(词元化)工具词束词汇密度词项、词语项目词汇触发理论Lexical richnessLexico-grammar/Lexical grammar LexisLL/Log likelihood (ratio) Longitudinal/Developmental corpus Machine-readableMarkupMDA/Multi-dimensional approach MetadataMeta-metadata 词汇丰富度词汇语法词语、词项对数似然比、对数似然率跟踪语料库、发展语料库、历时语料库机读的标记、置标多维度分析法元信息元元信息MF/MD approach Mini-text Misuse (Multi-feature/Multi-dimensional) 多特征/多维度分析法微型文本误用Monitor corpusMonolingual corpus Multilingual corpusMultimodal corpusMWU/Multiword unitMWE/Multiword expressionMI/Mutual informationN-gramNLP/Natural Language Processing NodeNormalizationNormalized frequency Observed corpusOntologyOpen Choice PrincipleOveruseParadigmaticParallel corpusParole linguisticsParsed corpusParserParsingPattern/patterningPattern grammarPedagogic corpusPhraseologyPOSgramPOS tagging/Part-of-Speech tagging (动态)监察语料库单语语料库多语语料库多模态语料库多词单位多词单位互信息、互现信息N 元组、N 元序列、N 元结构、N 元词、多词序列自然语言处理节点(词)标准化标准化频率、标称频率、归一频率观察语料库知识本体、本体开放选择原则超用、过多使用、使用过度、过度使用纵聚合(关系)的平行语料库、对应语料库言语语言学句法标注的语料库句法分析器句法分析型式型式语法教学语料库短语、短语学赋码序列、码串词性赋码、词性标注、词性附码POS taggerPrefabProbabilisticProbabilityRationalismRaw text/Raw corpusReference corpusRegex/RE/RegExp/Regular Expressions Register variationRelative frequency Representative/RepresentativenessRule-basedSample n./v.SamplingSearch termSearch wordSegmentationSemantic preferenceSemantic prosody 词性赋码器、词性赋码工具预制语块(基于)概率的、概率性的、盖然的概率理性主义生文本(语料)参照语料库正则表达式语域变异相对频率代表性(的)基于规则的样本;取样、采样、抽样取样、采样、抽样检索项检索词切分、分词语义倾向语义韵SGML/Standard Language SkipgramSpanSpecial purpose corpus Specialized corpus Generalized Markup 标准通用标记语言跨词序列、跨词结构跨距专用语料库、专门用途语料库、专题语料库专用语料库Standardized TTR/Standardized type-token 标准化类符/形符比、标准化类/形比、标准化ratio 型次比Stand-off annotation Stop listStop word Synchronic corpus SyntagmaticTagTaggerTaggingTag sequenceTagsetTextTEI/Text Encoding I nitiative The L exical A pproachThe Lexical Syllabus TokenToken definition 分离式标注停用词表、过滤词表停用词、过滤词共时语料库横组合(关系)的标记、码、标注码赋码器、赋码工具、标注工具赋码、标注、附码赋码序列、码串赋码集、码集文本文本编码计划词汇中心教学法词汇大纲形符、词次形符界定、单词界定TokenizationTokenizerTranscriptionTranslational corpusTreebankTrigramT-scoreTypeTTR/Type-tokenratioUnderuseUnicodeUnit of meaningWaC/Web as CorpusWildcardWord definitionWord formWord familyWord listXML/EXtensible Markup Language Zipf's LawZ-score 分词分词工具转写翻译语料库树库三元组、三元序列、三元结构T 值类符、词型类符/形符比、类/形比、型次比少用、使用不足通用码意义单位网络语料库通配符单词界定词形词族词表可扩展标记语言齐夫定律Z 值。

国家语委现代汉语通用平衡语料库

国家语委现代汉语通用平衡语料库

国家语委现代汉语通⽤平衡语料库国家语委现代汉语通⽤平衡语料库标注语料库数据及使⽤说明1. 国家语委现代汉语通⽤平衡语料库1.1 语料库全库国家语委现代汉语通⽤平衡语料库全库约为1亿字符,其中1997年以前的语料约7000万字符,均为⼿⼯录⼊印刷版语料;1997之后的语料约为3000万字符,⼿⼯录⼊和取⾃电⼦⽂本各半。

语料库的通⽤性和平衡性通过语料样本的⼴泛分布和⽐例控制实现。

语料库类别分布如下所⽰:1.2 标注语料库标注语料库为国家语委现代汉语通⽤平衡语料库全库的⼦集,约5000万字符。

标注是指分词和词类标注,已经经过3次⼈⼯校对,准确率⼤于>98%。

语料库全库按照预先设计的选材原则进⾏平衡抽样,以期达到更好的代表性。

标注语料库在样本分布⽅⾯近似于全库,不破坏语料选材的平衡原则。

标注语料库类别分布如下所⽰:标注语料库与全库的样本分布⽐较如下所⽰:(蓝⾊曲线为语料库全库;红⾊曲线为标注语料库)2. 国家语委现代汉语通⽤平衡语料库语料选材与样本分布2.1 选材原则依据材料内容,选材⼤体作如下分类:(下⽂字数为建库时数据)2.1.1 教材⼤中⼩学教材单作⼀类,约2000万字。

2.1.2 ⼈⽂与社会科学的语⾔材料约占全库的60%,共3000万字,包括:·政法(含哲学、政治、宗教、法律等);·历史(含民族等)·社会(含社会学、⼼理、语⾔、教育、⽂艺理论、新闻学、民俗学等);·经济;·艺术(含⾳乐、美术、舞蹈、戏剧等);·⽂学(含⼝语);·军体;·⽣活(含⾐⾷住⾏等⽅⾯的普及读物)。

2.1.3 ⾃然科学(含农业、医学、⼯程与技术)的语⾔材料,应涉及其发展的各个领域。

拟从⼤、中、⼩学教材和科普读物中选取。

其中,科普读物约占6%,共300万字。

教材字数另计。

2.1.4 报刊。

以1949年以后正式出版的由国家、省、市及各个部委主办的报纸和综合性刊物为主,兼顾1949年以前的报纸和综合性刊物。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154
词语
出现次数 744863 130191 118823 118527 83958 81119 65146 53556 52912 52728 47908 46965 44947 42332 41116 40849 38084 35429 34323 33991 31512 30936 30123 29749 29265 29039 28769 28404 28038 26823 25715 24807 23823 23749 22029 21744 21148 21041 20907 20210 19915 19539 18963 18950 18805 18698
155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 却 主要 再 由于 我国 最 关系 作用 不同 中国 才 人们 出 但是 现在 则 需要 所以 因此 如果 已经 一定 们 各 重要 象 一些 情况 吧 二 次 月 便 知道 时候 做 必须 成 人民 四 走 出来 活动 同 方面 条 高 吗
8956 8730 8684 8657 8632 8443 8375 8352 8200 8074 7980 7961 7954 7924 7851 7791 7780 7706 7687 7678 7642 7620 7610 7519 7416 7375 7362 7317 7290 7275 7135 7102 7071 6923 6877 6838 6837 6828 6790 6773 6721 6708 6703 6690 6669 6608 6594 6574 6569 6561 6523 6511 6508 6508
科学 也是 即 条件 天 许多 通过 思想 发生 叫 为了 老 过程 比 起 而且 影响 方法 要求 内 技术 点 一般 较 让 具有 形成 对于 日 事 时间 认为 还是 真 长 世界 只有 以后 教育 它们 同时 性 表现 产生 出现 企业 社会主义 作 者 没 各种 之间 家 组织
6505 6473 6439 6427 6401 6389 6387 6370 6317 6289 6285 6236 6231 6229 6218 6116 6103 6098 6097 6025 5978 5958 5944 5936 5893 5870 5853 5846 5809 5775 5761 5720 5720 5709 5702 5697 5668 5651 5648 5634 5626 5575 5572 5563 5544 5489 5480 5480 5475 5449 5419 5380 5361 5334
0.0681 0.0677 0.0674 0.0673 0.067 0.0669 0.0668 0.0667 0.0661 0.0658 0.0658 0.0653 0.0652 0.0652 0.0651 0.064 0.0639 0.0638 0.0638 0.063 0.0626 0.0623 0.0622 0.0621 0.0617 0.0614 0.0612 0.0612 0.0608 0.0604 0.0603 0.0599 0.0599 0.0597 0.0597 0.0596 0.0593 0.0591 0.0591 0.059 0.0589 0.0583 0.0583 0.0582 0.058 0.0574 0.0573 0.0573 0.0573 0.057 0.0567 0.0563 0.0561 0.0558
26.7757 26.9647 27.1508 27.3322 27.5128 27.6928 27.8726 28.0516 28.2271 28.4022 28.5756 28.7426 28.8974 29.0492 29.2002 29.351 29.498 29.6436 29.7876 29.9311 30.0737 30.2159 30.3542 30.4924 30.6284 30.7609 30.8913 31.0184 31.1438 31.266 31.3875 31.5074 31.6268 31.7428 31.8569 31.9671 32.077 32.1861 32.2938 32.4006 32.5073 32.6134 32.7185 32.8235 32.928 33.0324 33.1324 33.2321 33.3306 33.4281 33.5253 33.6211 33.7167 33.8106
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
#现代汉语语料库词语频率表 #语料规模:2000万字 #只列入出现次数大于50次的词 #下载自语料库在线网站
id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 的 了 在 是 和 一 这 有 他 我 也 不 就 地 着 中 上 说 都 人 个 对 种 把 为 要 你 而 来 我们 又 一个 与 从 年 到 还 它 大 等 她 两 去 没有 里 得
0.191 0.189 0.1861 0.1814 0.1806 0.18 0.1798 0.179 0.1755 0.1751 0.1734 0.167 0.1548 0.1518 0.151 0.1508 0.147 0.1456 0.144 0.1435 0.1426 0.1422 0.1383 0.1382 0.136 0.1325 0.1304 0.1271 0.1254 0.1222 0.1215 0.1199 0.1194 0.116 0.1141 0.1102 0.1099 0.1091 0.1077 0.1068 0.1067 0.1061 0.1051 0.105 0.1045 0.1044 0.1 0.0997 0.0985 0.0975 0.0972 0.0958 0.0956 0.0939
频率(%) 累积频率(%) 7.7946 7.7946 1.3624 9.157 1.2434 10.4004 1.2403 11.6407 0.8786 12.5193 0.8489 13.3682 0.6817 14.0499 0.5604 14.6103 0.5537 15.164 0.5518 15.7158 0.5013 16.2171 0.4915 16.7086 0.4703 17.1789 0.443 17.6219 0.4303 18.0522 0.4275 18.4797 0.3985 18.8782 0.3707 19.2489 0.3592 19.6081 0.3557 19.9638 0.3298 20.2936 0.3237 20.6173 0.3152 20.9325 0.3113 21.2438 0.3062 21.55 0.3039 21.8539 0.3011 22.155 0.2972 22.4522 0.2934 22.7456 0.2807 23.0263 0.2691 23.2954 0.2596 23.555 0.2493 23.8043 0.2485 24.0528 0.2305 24.2833 0.2275 24.5108 0.2213 24.7321 0.2202 24.9523 0.2188 25.1711 0.2115 25.3826 0.2084 25.591 0.2045 25.7955 0.1984 25.9939 0.1983 26.1922 0.1968 26.389 0.1957 26.5847
0.0937 0.0914 0.0909 0.0906 0.0903 0.0884 0.0876 0.0874 0.0858 0.0845 0.0835 0.0833 0.0832 0.0829 0.0822 0.0815 0.0814 0.0806 0.0804 0.0803 0.08 0.0797 0.0796 0.0787 0.0776 0.0772 0.077 0.0766 0.0763 0.0761 0.0747 0.0743 0.074 0.0724 0.072 0.0716 0.0715 0.0715 0.0711 0.0709 0.0703 0.0702 0.0701 0.07 0.0698 0.0691 0.069 0.0688 0.0687 0.0687 0.0683 0.0681 0.0681 0.0681
33.9043 33.9957 34.0866 34.1772 34.2675 34.3559 34.4435 34.5309 34.6167 34.7012 34.7847 34.868 34.9512 35.0341 35.1163 35.1978 35.2792 35.3598 35.4402 35.5205 35.6005 35.6802 35.7598 35.8385 35.9161 35.9933 36.0703 36.1469 36.2232 36.2993 36.374 36.4483 36.5223 36.5947 36.6667 36.7383 36.8098 36.8813 36.9524 37.0233 37.0936 37.1638 37.2339 37.3039 37.3737 37.4428 37.5118 37.5806 37.6493 37.718 37.7863 37.8544 37.9225 37.9906
相关文档
最新文档