WordNet. An electronic lexical database.
WordNet和词向量相结合的句子检索方法
第18卷第4期2017年8月信息工程大学学报Journal of Information Engineering UniversityYol. 18 No. 4Aug. 2017DOI:10. 3969/j.is s n. 1671-0673.2017.04. 020WordNet和词向量相结合的句子检索方法刘欣\席耀一2,王波\魏晗1(1.信息工程大学,河南郑州450001&2解放军外国语学院,河南洛阳471000)摘要:针对当前句子检索方法中因数据稀疏而存在的“词不匹配”问题,提出了一种W o d N e t和词向量相结合的句子检索方法。
首先在W o d N e t语义关系图中应用个性化P a g e R a n k算法计算与查询项最相关的同义词集合,实现查询项扩展,从而在一定程度上解决了查询项数据稀疏的问题;然后利用在大规模语料中训练神经网络语言模型获取的词向量对查询项和句子进行表示;最后引入W M D% w o rd m o v e r's d is ta n c e&计算查询项与句子的语义相似度,从而利用语义信息进一步降低“词不匹配”问题带来的影响,将句子按相似度值从高到低排序作为句子检索结果。
文章方法在T R E C2003和T R E C2004会议的项目中进行评测,M A P和R-P r e c is io n值相较于次优结果分别提高了13.29%和13.54%。
关键词!W o r d N e t;查询项扩展;词向量;语义相似度;句子检索中图分类号:T P391. 1 文献标识码:A文章编号:1671-0673(2017)04-0486-06W ordNet and W ord Em bedding Based Sentence Retrieval MetliodLIU Xin1,XI Yaoyi2,WANG Bo1,WEI Han1(1. I n formation Engineering University,Zhengzhou 450002,China;2. PLA University of Foreign Language,Luoyang 471000,China )Abstract:A W o n iN e t an d W ood E m b e d d in g based sentence re trie v a l m e tlio d is p ro p o se d in th is p ap e r to solve th e v o c a b u la ry m is m a tc h p ro b le m roo te d in th e s p a rs ity o f sentences ly,w e ru n th e p e rs o n a liz e d P a g e R a n k a lg o rith m o v e r th e g ra p h re p re s e n ta tio n o f W o rd N e t co n ce ptsa nd re la tio n s to ob ta inc o n c e p ts re la ted to the q u e rie s,w h ic h c o u ld p a rtia lly se ttle th e s p aq u e rie s.S e c o n d ly,th e w o rd e m b e d d in g s th a t re p re s e n t s e m a n tics o f th e q u e ry a nd se n ten ce areg a in e d th ro u g h tra in in g in la rg e-s c a le c o rp u s w ith th e C o n tin o u s S k ip-gram M o d e l.F in a lly,thera n k e d lis t o f re trie v a l re s u lts is a c h ie v e d b y a p p ly in g W o rd M o v e r’s D is ta n c e(W M D)to c a lc u la tes e m a n tic s im ila rity o f q u e ry a n d s e n te n c e,w h ic h ca n fu r tiie r h a n d le th e v o c a b u la ry m is le m.T h e e v a lu a tio n on T R E C2003a n d T R E C2004re ve a ls th a t th e p ro p o se d m e th o d is s u p e rio r to th e b a s e lin e se n ten ce re trie v a l m e th o d.T h e M A P a n d R-P re c is io n are13. 29%a nd13.54%h ig h e r th a n th e second b e st r e s u lt.Key words:W o r d N e t;q u e ry e xp a n s io n;w o rd e m b e d d in g;s e m a n tic s im ila rity;sentence re trie v a l一般来说,句子检索是根据给定的查询主题从 多个文档中查找相关的句子,在自动问答(q u e s tio n a n s w e rin g,Q A)、文档摘要(s u m m a riz a tio n)、新话题检测(n o v e lty d e te c tio n)'和信息起源(in fo rm a tio n p ro v e n a n c e)等系统中,句子检索模块起着文本预处理作用,这些系统的性能与句子检索模块的性能收稿日期:2016-04-08;修回日期:2016-05-12基金项目:国家社会科学基金资助项目(14B X W028)作者筒介:刘欣(1990 -),男,硕士生,主要研究方向为自然语言处理,E-mail:syliuxinlyl@163. c o m 。
高中英语科技论文翻译单选题40题答案解析版
高中英语科技论文翻译单选题40题答案解析版1.The term “artificial intelligence” is often abbreviated as _____.A.AIB.IAC.AIID.IIA答案:A。
“artificial intelligence”的常见缩写是“AI”。
B 选项“IA”并非此词组的缩写。
C 和D 选项都是错误的组合。
2.In a scientific research paper, “data analysis” can be translated as _____.A.数据分析B.数据研究C.数据调查D.数据理论答案:A。
“data analysis”的正确翻译是“数据分析”。
B 选项“数据研究”应为“data research”。
C 选项“数据调查”是“data survey”。
D 选项“数据理论”是“data theory”。
3.The phrase “experimental results” is best translated as _____.A.实验结论B.实验数据C.实验结果D.实验过程答案:C。
“experimental results”的准确翻译是“实验结果”。
A 选项“实验结论”是“experimental conclusion”。
B 选项“实验数据”是“experimental data”。
D 选项“实验过程”是“experimental process”。
4.“Quantum mechanics” is translated as _____.A.量子物理B.量子力学C.量子化学D.量子技术答案:B。
“Quantum mechanics”是“量子力学”。
A 选项“量子物理”是“quantum physics”。
C 选项“量子化学”是“quantum chemistry”。
D 选项“量子技术”是“quantum technology”。
Lexicology
Lexicology (from lexiko-, in the Late Greek lexikon) is that partof linguistics which studies words, their nature and meaning, words' elements, relations between words (semantical relations), words groups and the whole lexicon.The term first appeared in the 1820s, though there were lexicologists in essence before the term wascoined. Computational lexicology as a related field (in the same way that computational linguistics is related to linguistics) deals with the computational study of dictionaries and their contents. An allied science to lexicology is lexicography, which also studies words in relation with dictionaries - it is actually concerned with the inclusion of words in dictionaries and from that perspective with the whole lexicon. Therefore lexicography is the theory and practice of composing dictionaries. Sometimes lexicography is considered to be a part or a branch of lexicology, but the two disciplines should not be mistaken:lexicographers are the people who write dictionaries, they are at the same time lexicologists too, but notall lexicologists are lexicographers. It is said that lexicography is the practical lexicology, it is practically oriented though it has its own theory, while the pure lexicology is mainly theoretical.[hide]1 Lexical semanticso 1.1 Domaino 1.2 History▪ 1.2.1 Prestructuralist semantics▪ 1.2.2 Structuralist and neostructuralist semantics▪ 1.2.3 Chomskyan school▪ 1.2.4 Cognitive semantics2 Phraseology3 Etymology4 Lexicographyo 4.1 Noted lexicographers5 Lexicologists6 Bibliography7 References8 See also9 External linkso9.1 Societieso9.2 Theory[edit]Lexical semanticsMain article: Semantics[edit]DomainSemantical relations between words are manifested in respectof homonymy, antonymy, paronymy, etc. Semantics usually involved in lexicological work is called lexical semantics. Lexical semantics is somewhat different from other linguistic types of semantics like phrase semantics, semantics of sentence, and text semantics, as they take the notion of meaning in much broader sense. There are outside (although sometimes related to) linguistics types of semantics like cultural semantics and computational semantics, as the latest is not relatedto computational lexicology but to mathematical logic. Among semantics of language, lexical semantics is most robust, and to some extend the phrase semantics too, while other types of linguistic semantics are new and not quite examined.[edit]HistoryLexical semantics may not be understood without a brief exploration of its history.[edit]Prestructuralist semanticsSemantics as a linguistic discipline has its beginning in the middle of the 19th century, and because linguistics at the time was predominantly diachronic, thus lexical semantics was diachronic too - it dominated the scene between the years of 1870 and 1930.[1] Diachronic lexical semantics was interested without a doubt in the change of meaning withpredominantly semasiological approach, taking the notion of meaning in a psychological aspect: lexical meanings were considered to be psychological entities), thoughts and ideas, and meaning changes are explained as resulting from psychological processes.[edit]Structuralist and neostructuralist semanticsWith the rise of new ideas after the ground break of Saussure's work, prestructuralist diachronic semantics was considerably criticized for the atomic study of words, the diachronic approach and the mingle of nonlinguistics spheres of investigation. The study became synchronic, concerned with semantic structures and narrowly linguistic.Semantic structural relations of lexical entities can be seen in three ways:▪semantic similarity▪lexical relations such as synonymy, antonymy, and hyponymy▪syntagmatic lexical relations were identifiedAs structuralist lexical semantics was revived by neostructuralist not much work was done by them, it is actually admitted by the followers.It may be seen that WordNet "is a type of an online electronic lexical database organized on relational principles, which now comprises nearly 100,000 concepts" as Dirk Geeraerts[2] states it. [edit]Chomskyan schoolMain article: Generative semanticsFollowers of Chomskyan generative approach to grammar soon investigated two different types of semantics, which, unfortunately, clashed in an effusive debate[3], these were interpretativeand generative semantics.[edit]Cognitive semanticsMain article: Cognitive semanticsCognitive lexical semantics is thought to be most productive of the current approaches.[edit]PhraseologyMain article: PhraseologyAnother branch of lexicology, together with lexicographyis phraseology. It studies compound meanings of two or more words, as in "raining cats and dogs". Because the whole meaning of that phrase is much different from the meaning of words included alone, phraseology examines how and why such meanings come in everyday use, and what possibly are the laws governing these word combinations. Phraseology also investigates idioms.[edit]EtymologyMain article: EtymologySince lexicology studies the meaning of words and their semantic relations, it often explores the origin and history of a word, i.e. its etymology. Etymologists analyse related languages using a technique known as the comparative method. In this way, word roots have been found that can be traced all the way back to the origin of, for instance, the Indo-European language family. However, the comparative method is unhelpful in the case of "multiplecausation"[4], when a word derives from several sources simultaneously as in phono-semantic matching.[5]Etymology can be helpful in clarifying some questionable meanings, spellings, etc., and is also used in lexicography. For example, etymological dictionaries provide words with their historical origins, change and development.[edit]LexicographyMain article: LexicographyA good example of lexicology at work, that everyone is familiar with, is that of dictionaries and thesaurus. Dictionaries are books or computer programs (or databases) that actually represent lexicographical work, they are opened and purposed for the use of public.As there are many different types of dictionaries, there are many different types of lexicographers.Questions that lexicographers are concerned with are for example the difficulties in defining what simple words such as 'the' mean, and how compound or complex words, or words with many meanings can be clearly explained. Also which words to keep in and which not to include in a dictionary.[edit]Noted lexicographersMain article: LexicographerSome noted lexicographers include:▪Dr. Samuel Johnson (September 18, 1709 – December 13, 1784)▪French lexicographer Pierre Larousse (October 23, 1817-January 3, 1875)▪Noah Webster (October 16, 1758 – May 28, 1843)▪Russian lexicographer Vladimir Dal (November 10, 1801 –September 22, 1872)[edit]Lexicologists▪Damaso Alonso, (Oct. 22, 1898-) Spanish literary critic and lexicologist▪Roland Barthes, (Nov. 12, 1915-Mar. 25, 1980) French writer, critic and lexicologist[edit]Bibliography▪Lexicology/Lexikologie: International Handbook on the Nature and Structure of Words and Vocabulary/EinInternationales Handbuch Zur Natur and Struktur Von Wortern Und Wortschatzen, Vol 1. & Vol 2. (Eds. A. Cruse et al.)▪Words, Meaning, and Vocabulary: An Introduction to Modern English Lexicology, (ed. H. Jackson); ISBN 0-304-70396-6▪Toward a Functional Lexicology, (ed. G. Wotjak); ISBN 0-8204-3526-0▪Lexicology, Semantics, and Lexicography, (ed. J.Coleman); ISBN 1-55619-972-4▪English Lexicology: Lexical Structure, Word Semantics & Word-formation,(Leonhard Lipka.); ISBN 9783823349952▪Outline of English Lexicology , (Leonhard Lipka.); ISBN 3484410035[edit]References1. ^ Dirk Geeraerts, The theoretical and descriptivedevelopment of lexical semantics, Prestructuralist semantics, Published in: The Lexicon in Focus. Competition andConvergence in Current Lexicology, ed. Leila Behrens andDietmar Zaefferer, p. 23-422. ^ Dirk Geeraerts, The theoretical and descriptivedevelopment of lexical semantics, Structuralist andneostructuralist semantics, Published in: The Lexicon inFocus. Competition and Convergence in Current Lexicology, ed. Leila Behrens and Dietmar Zaefferer, p. 23-423. ^ Harris, Randy Allen (1993) The Linguistics Wars, Oxford,New York: Oxford University Press4. ^ Zuckermann, Ghil'ad (2009), Hybridity versusRevivability: Multiple Causation, Forms andPatterns, Journal of Language Contact, Varia 2: 40-67.5. ^ Zuckermann, Ghil'ad (2003), ‘‘Language Contact andLexical Enrichment in Israeli Hebrew’’, Houndmills: Palgrave Macmillan, (Palgrave Studies in Language History andLanguage Change, Series editor: Charles Jones). ISBN1-4039-1723-X.。
郑大远程教育《信息检索03》答案
A B
C D 、《全国报刊索引》是
A B
C D 、《中国博士学位论文提要》是
A B
C D 、实用型专利的有效期,我国自申请日算起是
A B
C D 、下列专利说明书编号中表示实用型专利的是
A B
C D
A、分类途径
B、主题途径
C、著者工作单位途径
D、著者途径
3、《中国专利索引》的检索途径有( )
A、分类途径
B、专利权人途径
C、主题途径
D、专利序号途径
4、我国专利法规定,一件发明创造要获得专利权必须应当具有( )
A、新颖性
B、规范性
C、实用性
D、创造性
5、我国标准按其适用范围分为( )
A、国家标准
B、行业标准
C、地方标准
D、企业标准
第三题、判断题(每题1分,5道题共5分)
1、《全国报刊索引》是一种综合性检索工具。
正确错误2、《国外科技文献检索工具简介》是一种文献指南。
正确错误3、在专利有效期内,别人未经专利权人许可,不得随意使用该项专利。
正确错误、实用型专利的有效期,我国是自申请日算起
正确错误、对产品的形状所提出的新的技术方案属于发明专利。
正确错误。
WORD ASSOCIATION NORMS, ] IUTUAL INFORMATION, AND LEXICOGRAPHY
The term word association is used in a very particular sense in the psycholinguistic literature. (Generally speaking, subjects respond quicker than normal to the word nurse if it follows a highly associated word such as doctor.) We will extend the term to provide the basis for a statistical description of a variety of interesting linguistic phenomena, ranging from semantic relations of the doctor/nurse type (content word/content word) to lexico-syntactic co-occurrence constraints between verbs and prepositions (content word/function word). This paper will propose an objective measure based on the information theoretic notion of mutual information, for estimating word association norms from computer readable corpora. (The standard method of obtaining word association norms, testing a few thousand :mbjects on a few hundred words, is both costly and unreliable.) The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words.
江南大学上半信息检索与利用第1阶段练习题题目
江南大学现代远程教育第一阶段练习题考试科目:《信息检索与利用》第1章至第3章(总分100分)学习中心(教学点)批次:层次:专业:学号:身份证号:姓名:得分:一、单项选择题(每题2分,共20分)1、经过人类加工、有使用价值或者潜在使用价值的信息称为()。
A、知识B、文献C、情报D、信息资源2、以下资源中()不是三次信息资源。
A、百科全书B、索引C、手册D、指南3、《中国图书馆分类法》分为五大部类,在此基础上扩展为()大类。
A、25B、10C、22D、164、期刊的国际标准连续出版物号简称()。
A、ISSNB、ISDSC、ISOD、ISBN5、Google检索框中输入多个词,词与词之间用空格隔开表示()关系。
A、逻辑或B、逻辑非C、逻辑与D、短语或词组6、短语或词组检索常用运算符为( )。
A、()B、*C、?D、“”7、目录与题录的主要不同点在于( )不同。
A、著录对象B、著录方法C、著录格式D、著录载体8、用百度(Baidu)搜索“能源”但排除“核能”方面的信息,使用检索式()。
A、能源|核能B、能源 -核能C、能源+核能D、能源核能9、()服务主要通过复制、拷贝、扫描原文,然后采用邮寄、传真、e-mail等方式将文献传递到用户。
A、资源共享B、文献递传C、OPACD、一站式检索10、( )是指检出的相关文献量与检索系统中相关文献总量的百分比。
A、查准率B、漏检率C、查全率D、误检率二、多项选择题(每题2分,共20分)1、信息资源从信息载体的角度可分为()。
A、纸质信息资源B、缩微型信息资源C、声像型信息资源D、电子信息资源2、以下文献中()是特种文献。
A、科技报告B、产品样本C、会议录D、丛书3、以下信息资源中()不是二次信息资源。
A、目录B、述评C、教材D、索引4、检索方法一般包括()方法。
A、工具法B、引文法C、主题法D、循环法5、以下搜索引擎中()是全文搜索引擎。
A、百度(Baidu)B、GoogleC、Yahoo!D、InfoSpace6、影响用户查全率的因素有( )。
2007试卷(B)答案
2004级医学信息专业文献信息检索试卷参考答案(B)一、名词解释(24分,每个4分)1.搜索引擎:包括广义和狭义两类(1分),广义的搜索引擎泛指网络上提供信息检索服务的工具和系统。
又称为网络检索工具。
类似于印刷时代的检索工具(1.5分)。
狭义的搜索引擎主要指利用网络自动搜索技术软件(robot),对互联网(主要是WWW)网络资源进行收集、组织并提供检索服务的一类信息服务系统(1.5分)2.PQDD--------ProQuest Digital Dissertations, 由美国UMI提供(1分),特点:世界上最大的学位论文数据库,多学科(1分),收录欧美1000多所大学的博、硕士学位论文(1分),1997年以来的论文可看到全文的前24页,提供基本检索和高级检索两种方式(1分)。
3.PubMed------由美国NCBI提供的检索系统(1分),能检索MEDLINE、OLDMEDLINE、PREMEDLINE和期刊出版商提供的文献(1分),具有自动词语匹配、精确短语匹配、定题检索等功能(1分),提供简单检索、辅助检索和专用检索(1分)。
4.科技报告:是科学技术报告的简称(1分)。
根据国家标准(GB7713—87)给出的定义:科技报告是描述一项科学技术研究的结果或进展或一项技术研制试验和评价的结果(1.5分);或是论述一项科学技术问题的现状和发展的文件(1.5分)。
5.CrossSearch 跨库检索(1分)。
可以在同一个检索界面上一次性检索多个数据库(1.5分),并得到去重后的检索结果。
(1.5分)6.百科全书:Encyclopaedia(1分),汇集百科(1分),分条叙述(1分),附有参考书目,按词典形式编排的工具书(1分)。
二、填空题(16分)1.PUBMED的辅助检索区提供5种检索功能limits 、history、details、Preview/Index 和clipboard。
2.EMBASE的主题词表共提供有77个LINK TERM,其中DRUG THERAPY (药物治疗) 既是疾病连接词又是药物连接词。
浙江省诸暨中学2017-2018学年高一上学期第二阶段考信息技术试卷(含答案)(2018.01)
诸暨中学2017-2018学年上期第二次阶段性考试高一信息技术2018.01(说明:本试卷全部试题均答在答题纸上)一、选择题(本大题共9小题,共18分)1.下列关于信息和信息技术的说法,正确的是()A.书本不是信息,但文字属于信息B.信息具有载体依附性,但也有少部分信息不具备载体C.信息和物质、能源最大的不同在于它具有共享性D.由于电子计算机是近代才出现的,因此古代没有信息技术2.小王用IE浏览器打开“百度”主页,下列说法不正确的是()A.该网页采用文件传输协议ftp协议来发送和接收信息B.网页文件遵循HTML语言标准,可以用记事本打开并编辑C.如果只需保存网页中的文字信息,可以选择的保存类型为“文本(*.txt)”D.收藏该网站就是保存百度主页的地址“”3.下列描述属于人工智能应用范畴的是()A.地铁站使用X光机对旅客行李进行安检扫描B.地图软件在有wifi连接的地方自动升级数据C.高速公路ETC通道自动识别车牌号码进行收费D.医生使用B超探测病人身体4.下图所示为在UltraEdit软件中观察字符内码的部分界面:以下说法正确的是()A.存储字符“℃”需要1ByteB.气温之后的冒号(:)采用ASCII表示C.字符“38”的内码用二进制表示为00111000D.符号(~)的内码用十六进制表示为A1AB5.下图所示是ACCESS数据库中的数据表,以下说法正确的是()A.当前状态下执行“新记录(W)”命令,则新添加的记录位于第3行位置B.当前为数据视图,无法修改“性别”字段名称C.某字段数据类型若为“自动编号”,则新值不一定是递增的D.该数据库的名称为student.accdb6.使用Word软件编辑某文档,部分界面如图所示,下列说法正确的是()A.文档中图片的文字环绕方式为“嵌入型”B.用户t1和用户s2添加了批注C.文章中共有两处修订,两处批注D.拒绝所有修订后,开头的文字是“记者从今天浙江省政府新闻信息办召开的…”7.使用GoldWave软件打开某音频文件,其编辑界面如图所示:下列说法不正确...的是()A.该音频的采样频率为44.1KHz,比特率为1411kbpsB.如果插入10秒“静音”,以当前参数保存,音频文件容量将增加1/6C.当前状态下执行“剪裁”操作,以当前参数保存,文件存储容量不变D.当前状态下执行“删除”操作,以当前参数保存,文件存储容量约为8MB8.某未经压缩的图像文件和视频文件相关信息分别如下图所示:由此可知图像文件与视频文件的存储容量比例约为()A.9:15B.1:25C.9:25D.18:259.有一个魔方共有6种不同的颜色,对魔方的任意一面进行二进制编码,至少需要多少位二进制()A.54 B.36 C.27 B.18二、非选择题(本大题共4小题,其中第1小题6分,第2小题5分,第15小题8分,第16小题6分,第3小题8分,第4小题12分,共32分)1、小王收集了2017年5月美国SUV销量排行榜数据,并用Excel进行数据处理,如图所示。
Usage of WordNet in natural language generation
Abstract
WordNet has rarely been applied to natural language generation, despite of its wide application in other fields. In this paper, we address three issues in the usage of WordNet in generation: adapting a general lexicon like WordNet to a specific application domain, how the information in WordNet can be used in generation, and augmenting WordNet with other types of knowledge that are helpful for generation. We propose a three step procedure to tailor WordNet to a specific domain, and carried out experiments on a basketball corpus (1,015 game reports, 1.7MB).
Usage of WordNet in Natural Language Generation
Hongyan Jing Department of Computer Science
Columbia University New York, NY 10027, USA
hjing@
Secondly, as a semantic net linked by lexical relations, WordNet can be used for lexicalization in generation. Lexicalization maps the semantic concepts to be conveyed to appropriate words. Usually it is achieved by step-wise refinements based on syntactic, semantic, and pragmatic constraints while traversing a semantic net (Danlos, 1987). Currently most generation systems acquire their semantic net for lexicalization by building their own, while WordNet provides the possibility to acquire such knowledge automatically from an existing resource.
国内外主要本体库比较分析研究
国内外主要本体库比较分析研究白如江/于晓繁/王效岳2012-10-22 11:13:10 来源:《现代图书情报技术》(京)2011年1期【英文标题】The Comparative Analysis of Major Ontology Repository at Home and Abroad【作者简介】白如江、于晓繁、王效岳,山东理工大学科技信息研究所(淄博255049)【内容提要】介绍4种国内外主要的通用本体库WordNet、DBpedia、Cyc、HowNet和两个比较成功的专业领域本体库生物医学和企业领域本体库,从描述语言、存储方式、查询语言、构建平台和应用领域5个方面分别对4种通用本体库和领域本体库进行比较分析,为国内外学者在本体库及其应用研究方面提供帮助。
The paper introduces the major general Ontology libraries in domestic and foreign: WordNet、DBpedia、Cyc and HowNet, and the successful professional domain Ontology libraries: Biomedical Ontology and Enterprise Ontology. Then it separately compares and analyzes them from five aspects as the description language, storage mode, query language, platform building and application to provide assistance for the study in Ontology library and its application.【关键词】本体库/WordNet/DBpedia/Cyc/HowNet/生物医学本体/企业管理本体Ontologylibrary/WordNet/DBpedia/Cyc/HowNet/Biomedical Ontology/Enterprise Ontology1 背景本体(Ontology)的概念最早起源于哲学领域[1],作为语义基础被广泛应用于信息检索、人工智能、语义网络、软件工程、自然语言处理、电子商务和知识管理等领域。
WordNet for Lexical Cohesion Analysis
WordNet for Lexical Cohesion AnalysisElke Teich1and Peter Fankhauser21Department of English LinguisticsDarmstadt University of Technology,GermanyEmail:E.Teich@mx.uni-saarland.de2Fraunhofer IPSI,Darmstadt,GermanyEmail:fankhaus@ipsi.fraunhofer.deAbstract.This paper describes an approach to the analysis of lexical cohesion usingWordNet.The approach automatically annotates texts with potential cohesive ties,andsupports various thesaurus based and text based search facilities as well as differentviews on the annotated texts.The purpose is to be able to investigate large amounts oftext in order to get a clearer idea to what extent semantic relations are actually used tomake texts lexically cohesive and which patterns of lexical cohesion can be detected.1IntroductionUsing a thesaurus to annotate text with lexical cohesive ties is not a new idea.The original proposal is due to[1],who manually annotated a set of sample texts employing Roget’s Thesaurus.With the development of WordNet[2,3],several proposals for automizing this process have been made.For the purpose of detecting central text chunks which can be used for summarization(e.g.[4,5]),this seems to work reasonably well.But how well does an automatic process perform in terms of linguistic-descriptive accuracy?It is well known from the linguistic literature that any two words between which there exists a semantic relation may or may not attract each other so as to form a cohesive tie(cf.[6]).So when do we interpret a semantic relation between two or more words instantiated in text as cohesive or not?First,certain parts-of-speech may be more likely to contract lexical cohesive ties than others,e.g.,nouns may be more likely to participate in substantive cohesive ties than verbs. Another motivation may be the type of vocabulary:special purpose vocabulary may be more likely to contract cohesive ties than general vocabulary.Another possible hypothesis is that cohesive patterns differ due to the type of text(register,genre).While repetition generally appears to be the dominant means to establish lexical cohesion,the relative frequency of more complex relations,such as hyponymy or meronymy may depend on the type of text.In order to investigate such issues,large amounts of data annotated for lexical cohesion are needed.Manual analyses are very time-consuming and may not reach a satisfactory intersubjective pletely automatic analysis may introduce significant noise due to ambiguity[4].Thus,in this paper we follow an approach in the middle ground.We use the sense-tagged version of the Brown Corpus,where nouns,verbs,adjectives,and adverbs are manually disambiguated w.r.t.WordNet,and use WordNet to annotate the corpus with potential lexical cohesive ties.(Section2.1).We also describe facilities forfiltering candidate ties and for generating different views on the annotated text.(Section2.2).We discuss the results on an exemplary basis,comparing the automatic annotation with a manual annotation (Section3).We conclude with a summary and issues for future work(Section4).Petr Sojka,Karel Pala,Pavel Smrž,Christiane Fellbaum,Piek V ossen(Eds.):GWC2004,Proceedings,pp.326–331.c Masaryk University,Brno,2003WordNet for Lexical Cohesion Analysis327 2Lexical Cohesion Using WordNetLexical cohesion is commonly viewed as the central device for making texts hang together experientially,defining the aboutness of a text(field of discourse)(cf.[6,chapter6].Along with reference,ellipsis/substitution and conjunctive relations,lexical cohesion is said to formally realize the semantic coherence of texts,where lexical cohesion typically makes the most substantive contribution(according to[7],aroundfifty percent of a text’s cohesive ties are lexical).The simplest type of lexical cohesion is repetition,either simple string repetition or repetition by means of inflectional and derivational variants of the word contracting a cohesive tie.The more complex types of lexical cohesion rely on the systemic semantic relations between words,which are organized in terms of sense relations(cf.[6,278–282]). Any occurrence of repetition or of relatedness by sense relation can potentially form a cohesive tie.2.1Determining Potential Cohesive TiesMost of the standard sense relations are provided by WordNet,thus it can form a suitable basis for automatic analysis of lexical cohesion.As the corpus,we use the Semantic Concordance Version of the Brown Corpus,which comprises352texts(out of500)3Each text is segmented into paragraphs,sentences,and words,which are lemmatized and part-of-speech(pos) tagged.For185texts,nouns,verbs,adjectives,and adverbs are in addition sense-tagged with respect to WordNet1.6,i.e.,with few exceptions,they can be unambiguously mapped to a synset in WordNet.For the other167texts,only verbs are sense-tagged.Using these mappings,we determine potential cohesive ties as follows.For every sense-tagged word we compute its semantic neighborhood in WordNet,and for each synset in the semantic neighborhood we determine thefirst subsequent word that maps to the synset.For the semantic neighborhood we take into account and distinguish between most of the available kinds of relations in WordNet:synonyms,hyponyms,hypernyms,cohypernyms, cohyponyms,meronyms,holonyms,comeronyms,coholonyms,antonyms,the pos specific relations alsoSee,similarTo for adjectives,entails and causes for verbs,and the(rather scarce) relations across parts-of-speech attribute,participleOf,and pertainym.Where appropriate, the relations are defined transitively,for example,hypernyms comprise all direct and indirect hypernyms,and meronyms comprise all direct,indirect,and inherited meronyms.In addition, we also take into account lexical repetition(same pos and lemma,but not necessarily same synset),and proper noun repetition.For each potential cohesive tie we determine in addition the number of intervening sentences,the distance of the participating words from a root in the WordNet hypernymy hierarchy,as a very rough measure of their specificity,and the length and branching factor of the underlying semantic relation,as very rough measures of its strength.328 E.Teich,P.FankhauserDue to the excessive computation of transitive closures for the semantic neighborhood of each(distinct)word,this initial step is fairly demanding computationally.In the current implementation(which can be optimized),processing a text with about two thousand words takes about15minutes on an average PC.Thus we perform this annotation offline.2.2Constraints and Views on Lexical CohesionThe automatically determined ties typically do not all contribute to lexical cohesion.Which ties are considered cohesive depends,among other things,on the type of text and on the purpose of the cohesion analysis.To facilitate a manual post analysis of the ties we can filter them by simple constraints on pos,the specificity of the participating synsets,the kind, distance,and branching factor of the underlying relation,and the number of intervening sentences.This allows,for example,to exclude very generic verbs such as“be”,and to focus the analysis on particular relations,such as lexical repetition without synonymy(rare),or hypernyms only.The remaining ties are combined to lexical chains in two passes.In thefirst pass,all transitively related(forward-)ties are combined.The resulting chains are not necessarily disjoint,because there may be words w1,w2,w3,where w1and w2are tied to w3,but w1is not tied to w2.This results in a fairly complex data structure,which is difficult to reason about and to visualize.Thus in a second pass,all chains that share at least one word are combined into a single chain.The resulting chains are disjoint w.r.t.words,and may optionally be further combined if they meet in a specified number of sentences.To further analyze lexical cohesion,we have realized three views.In the text view,each lexical chain is highlighted with an individual color,in such a way that colors of chains starting in succession are close.This view can give a quick grasp on the overall topic flow in the text to the extent it is represented by lexical cohesion.The chain view presents chains as a table with one row for each sentence,and a column for each chain ordered by the number of its words.This view also reflects the topical organization fairly well by grouping the dominant chains closely.Finally,the tie view displays for each word all its (direct)cohesive ties together with their properties(kind,distance,etc.).This view is mainly useful for checking the automatically determined ties in detail.In addition,all views provide hyperlinks to the thesaurus for each word in a chain to explore its semantic context.Moreover, some statistics,such as the number of sentence linking to and linked from a sentence,and the relative percentage of ties contributing to a chain are given.Becausefiltering ties and combining them to chains can essentially be performed in two (linear)passes over the text,and the chains are rather small(between2and200words), producing these views takes about2seconds for the texts at hand,and thus can be performed online.3Discussion of ResultsThis section discusses the results of the automatic analysis on a sample basis,comparing the automatic analysis with a manual analysis of thefirst20sentences of three texts from the “learned”section of the Brown corpus(texts j32,j33and j34).For the automatic analysis only nouns and verbs which are at least three relations away from a hypernym root,and adjectivesWordNet for Lexical Cohesion Analysis329 which are tied to a noun or a verb have been included.Following[9]for the manual analysis, whenever a choice had to be made on which type of relation to base the establishment of a tie,priority was given to repetition.Table1shows the results for the automatic analysis,Table2gives the results for the manual analysis(strongest chains only).The chains are represented by the anchor word for simple repetition,and a subset of the participating words for the other types of relations.The number of words and the number of sentences are given in parentheses.Table1.Major lexical chains–automatic analysisj32j34sentence/subject/word/...(26;14)stress(14;11)j33dictionary(16;11)linguist/linguistics(13;7)form(14;8)tone/tonal(11;5)information(11;9)field(6,6)text(11;9)store/storage(4;4)As can be seen from the tables,there is a basic agreement between the automatic and the manual analysis in terms of the strongest chains(e.g.,in j33,stress builds one of the major chains in both analyses).However,some ties are missed in the automatic analysis, e.g.,store/storage in j32or linguist/linguistics in j34.Also,there are some differences in the internal make-up of the established chains.For example,in j32,dictionary and information build major chains in both analyses,but the information chain includes a few questionable ties in the automatic analysis,e.g.,list as an indirect hyponym.Also,the chain around form includes word and stem in the automatic analysis,which would befine,but then words like prefix,suffix,ending at later stages of the text are not included.The chain built around complement/predicator/subject in j33is separate from the sentence chain in the manual analysis,but in the automatic analysis the two are arranged in one chain due to meronymy, thus resulting in the strongest chain for this text in the automatic analysis.The mismatches arise due to the following reasons.(1)Missing relations.Only some relations across parts-of-speech and derivational relations are accounted for in WordNet,330 E.Teich,P.Fankhausere.g.,linguistic/linguistics but not linguist/linguistics or store/storage4.(2)Spurious relations. Without constraints on the length and/or branching factor of a transitive relation rather questionable ties are established,e.g.alphabetic character as a rather remote member of list.(3)Sense proliferation.In some instances the sense-tagging appears to be overly specific,e.g.,in j34,explanation as‘a statement that explains’vs.‘a thought that makes prehensible’.Using synonymy without repetition these senses do not form a tie.On the other hand,with repetition,some questionable ties are established,e.g.,for linguistic as ‘linguistic’vs.‘lingual’.(4)Compound terms.In some instances the manual analysis did not agree with the automatic analysis pound terms.E.g.tonal language is sense-tagged as a compound term and thus not included in the chain around tone/tonal.Generally,the unconstrained automatic annotation is too greedy,i.e.,too many relations are interpreted as ties.Unsatisfactory precision is not so much of a problem,however,because the annotation can be made more restrictive,e.g.,by including a list of stop words and/or by determining the appropriate maximal branching and maximal distance for each text to be analyzed.More serious is the problem of not getting all the relevant links,i.e.,unsatisfactory recall,usually due to missing relations in the thesaurus.To a certain extent this can be overcome by combining chains that meet in some minimal number of sentences,as a very specific form of collocation.4Summary and EnvoiWe have presented a method of analyzing lexical cohesion automatically.Even if the results of the automatic analysis do not match one-to-one with a manual cohesion analysis,the automatic analysis is not that far off.Some problems are inherent,others can be remedied(cf. Section3).Even with an imperfect analysis result,we get a valuable source for the linguistic investigation of lexical cohesion.Knowing that not all words that are semantically related contract cohesive ties,we can set out to determine factors that constrain the deployment of sense relations for achieving cohesion comparing the automatic annotation with a manual analysis.Also,we can give tentative answers to the questions posed in Section1.Looking at part-of-speech,we can confirm that the strongest chains are established along nouns, and the strongest chains are established along the special purpose vocabulary rather than the general vocabulary5.Moreover,although repetition(and synonymy)is the most-used cohesive device,the frequency of other relations taken together(in particular hyponymy and meronymy)about matches that of repetition for the texts at hand.Future linguistic investigations will be dedicated to questions of this kind on a more principled basis.In order to get a more precise idea of the reliability of the automatic analysis, we are carrying out manual analyses on a principled selection of texts from the corpus(e.g., larger samples covering all registers in the corpus)and compare the results with those of the automatic analysis.Moreover,we plan to investigate cohesion patterns,such as the relative frequency of repetition vs.other types of relations,for the different registers.Ultimately,what we are after are cross-linguistic comparisons,including the comparison of translations with original texts in the same language as the target language[10].WordNet for Lexical Cohesion Analysis331 References1.Morris,J.,Hirst,G.:Lexical cohesion computed by thesaural relations as an indicator of thestructure of putational Linguistics17(1991)21–48.ler,G.A.,Beckwith,R.,Fellbaum,C.,Gross,D.,Miller,K.J.:Introduction to WordNet:Anon-line lexical database.Journal of Lexicography3(1990)235–244.3.Fellbaum,C.,ed.:WordNet:An electronic lexical database.MIT Press,Cambridge(1998).4.Barzilay,R.,Elhadad,M.:Using lexical chains for text summarization.In:Proceedings of ISTS97,ACL,Madrid,Spain(1997).5.Silber,H.G.,McCoy,K.F.:Efficient text summarization using lexical chains.In:Proceedings ofIntelligent User Interfaces2000.(2000).6.Halliday,M.,Hasan,R.:Cohesion in English.Longman,London(1976).7.Hasan,R.:Coherence and cohesive harmony.In Flood,J.,ed.:Understanding ReadingComprehension.International Reading Association,Delaware(1984)181–219.8.Fankhauser,P.,Klement,T.:XML for data warehousing–chances and challenges.In:Proceedingsof DaWaK03,LNCS2737,Prague,CR,Springer(2003)1–3.9.Hoey,M.:Patterns of lexis in text.Oxford University Press,Oxford(1991).10.Teich,E.:Cross-linguistic variation in system and text.A methodology for the investigation oftranslations and comparable texts.de Gruyter,Berlin and New York(2003).。
WordNet发展历程介绍
· WordNet 从一个简单的“词典浏览器”(dictionary browser)发展成一个自足的词汇数据 库(self-contained lexical database),主要的进步是从 1989 年年初开始的。当时 SusanChipman 不满于 WordNet 仅仅作为一个词汇浏览器而存在,要求研究小组开发一个 工具。该工具可以在 WordNet 的基础上阅读一个文本,并报告文本中词语的各种信息。这 一工具即所谓的“Word Filter”(词过滤器)。罕用的或不符合需要的词能够从小说文档中 被过滤出去,而同时更常见的词语可以用来替代这些词。这个工作很快使我们意识到必须 对词形的曲折变化进行处理。这使得我们处理了有关词形方面的一些问题,WordNet 中仅 包含词语的基本形式,如果文本中出现“ships”,WordNet 就无法识别它。Richard Beckwith 和 Miceael Colon 写了一个程序,叫做 Morphy,可以识别出文本中的“ships”的 词形式“ship”。到 1989 年 9 月,WordNet 就可以处理文本中的词形变化,并在词库中找到 相应的词语基本形式。
· WordNet 中的词来自不同的地方。Brown 语料库、Laurence Urdang 的同义反义小词典 (1978)、Urdang 修订的 Rodale 同义词词典(1978)、以及 Robert Chapmand 的第 4 版罗杰斯同义词词林(1977)等。1986 年下半年,Miller 得到海军研究与发展中心的 Fred Chang 的一个词表,Miller 将 Chang 的词表跟 WordNet 已有的词表进行了比较,令 人沮丧的结果是只有 15%的重合词语,于是 Miller 把 Chang 的词表加入到 WordNet 中。 1993 年,Miller 得到了 Ralph Grishman 和他在纽约大学的同事的一个词表,39143 个词, 这个词表实际上包含在著名的 COMLEX 词典中。这一次比较的结果是,WordNet 中只包 含了 COMLEX 中 74%的词。于是 Miller 又把这个词表加入到 WordNet 中animalfauna动作行为食物所有物动物groupgroupingprocessartifactlocation团体过程人工物处所quantityamoutattributemotivationmotiverelation数量属性动机关系bodynaturalobjectshapecognitionknowledge身体自然物外形认知知识naturalphenomenonstatecommunicationpersonhumanbeing自然现象状态通信人类substanceeventhappeningplantfloratime物质事件植物时间feelingemotion情感这25类也可进一步概括为11个基本类由25个语义类形成的有关名词的25个元文件在语义层次上一般都是比较浅的
17秋学期《网络信息文献检索》在线作业
17秋学期《网络信息文献检索》在线作业试卷总分:100 得分:0一、单选题 (共 20 道试题,共 40 分)1. 要在题名字段中一次性检索出所有包括“color”或“colour”这两个英文单词的信息,应使用怎样的截词符号A. ?B. *C. ??D. !满分:2 分2. 以下描述不是开放存取特点的是A. 投稿方便B. 发表快捷C. 出版费用高昂D. 检索方便满分:2 分3. 要比较清晰的掌握某一研究主题或研究内容的全部研究历程与成果,应该采用A. 直接检索法B. 循环法C. 顺序法D. 回溯检索法满分:2 分4. 数据库中提供的“结果排除”功能与逻辑检索中的哪个逻辑关系相同A. 非B. 与C. 或D. ()满分:2 分5. 以下不是按照展示信息资源的内容特征进行的信息组织方式是A. 分类法B. 主题法C. 分类主题法D. 序列法满分:2 分6. 数据库中提供的“二次检索”功能与逻辑检索中的哪个逻辑关系相同A. 非B. 与C. 或D. ()满分:2 分7. 以下不是网络信息资源组成形式的是A. 文本B. 声音C. 视频D. 实物满分:2 分8. 要检索有关某个人研究成果的信息,应该使用的检索字段是A. 题名B. 著者C. 关键词D. 分类满分:2 分9. 将多个检索字段连接起来组成检索策略的工具是A. 位置算符B. 括号C. 数学符号D. 逻辑算符满分:2 分10. 未形成正式出版物或进入社会进行交流的信息是A. 零次信息B. 一次信息C. 二次信息D. 三次信息满分:2 分11. 以下数据库不可以提供图书全文的有A. 超星电子书B. 南开大学电子书平台C. 读秀学术搜索D. CSSCI满分:2 分12. 要检索到一个具体的信息资源,如一篇文章、一本途书,应采用的检索方法是A. 间接检索法B. 直接检索法C. 回溯法D. 间隔交替法满分:2 分13. 检索质量受到多方面因素影响,以下不是影响因素的有A. 网络条件B. 检索词的顺序C. 数据库系统包含的信息D. 检索式满分:2 分14. 查找学术研究的现状时,应该优先选择的数据库类型是A. 数据类型数据库B. 学术数据库C. 多媒体数据库D. 休闲娱乐数据库满分:2 分15. 在搜索引擎中检索有关计算机的文献,如何使用截词检索方法表达计算机的不同拼写方式A. comput*B. comput??C. comput?D. comput~满分:2 分16. 使用直接检索法,不能够检索到的信息资源是A. 一篇论文B. 某位作者的全部C. 一本书D. 一篇会议论文满分:2 分17. 要优先处理检索式中的某个部分,可以使用下列哪个符号A. {}B. []C. ()D. 【】满分:2 分18. 网络信息资源检索中,一些检索指标具有制约的关系,与查全率相制约的指标是A. 漏检率B. 误检率C. 查准率D. 相应时间满分:2 分19. 要检索某个作者或谋篇文章的社会影响力,最佳的检索途径是A. 高级检索B. 专业检索C. 简单检索D. 引文检索满分:2 分20. 以下数据库不是采用回溯检索法原理的是A. CSSCIB. SCIC. CNKID. SSCI满分:2 分二、多选题 (共 10 道试题,共 20 分)1. 要完成学位论文的综述和开题工作,一般应该检索的网络信息资源类型包括A. 期刊论文B. 会议论文C. 学位论文D. 视频课程满分:2 分2. 在进行论文写作时,对于统计数据、价格等信息,应该选择使用的数据库类型是A. 学位论文数据库B. 专利数据库C. 实证数据库D. 年鉴数据库满分:2 分3. 网络信息资源检索工具种类很多,以下不是常用的网络信息检索工具的是A. 万方数据库B. 中国知网数据库C. 新华字典D. 中国大百科全书满分:2 分4. 不同类型、不同界面的数据库一般都会提供以下哪些共有的功能A. 分类导航B. 高级检索C. 简单检索D. 子库选择满分:2 分5. 以下不是按照信息资源公开的形式划分的信息资源是A. 全文信息B. 正式信息C. 主观信息D. 目录信息满分:2 分6. 实证数据库的使用,可以为论文的写作提供的资料包括A. 论著B. 数据C. 案例D. 专利满分:2 分7. 按照截词符截词的位置,我们可以把截词检索区分为A. 后截断B. 中截断C. 两端截断D. 前截断满分:2 分8. 以下不是学术数据库包含的内容的包括A. 歌曲B. 健身讲座视频C. 影视作品D. 模拟考试系统满分:2 分9. 以下不是网络信息资源检索的流程的是A. 分析检索需求B. 选择检索系统C. 查看全文D. 标引信息资源满分:2 分10. 以下属于广义网络信息资源检索的工作内容公开A. 对信息进行标引B. 信息分类C. 提取主题词D. 将标引过的信息存储起来满分:2 分三、判断题 (共 20 道试题,共 40 分)1. 各类型的数据库一般都提供了对检索结果的二次检索功能A. 错误B. 正确满分:2 分2. 期刊数据库可以帮助我们在论文写作过程中完成对部分参考文献的编辑工作A. 错误B. 正确满分:2 分3. 网络信息资源与传统信息资源在信息数量与信息内容上仍然存在一定差别,应根据检索需求同时使用才能保证检索的全面性A. 错误B. 正确满分:2 分4. 在一个检索系统中,如CNKI、维普期刊数据库等,可以通过多种信息资源组织法组织信息A. 错误B. 正确满分:2 分5. 读秀学术搜索提供的文献载体形式仅限于图书一种形式A. 错误B. 正确满分:2 分6. 在同一个检索系统中,用同一个作者名在“作者”字段和“第一作者”字段检索到的结果数量不一样A. 错误B. 正确满分:2 分7. 获得开放获取文献时,读者需要付费才能看到全文A. 错误B. 正确满分:2 分8. 每个数据库都是单独存在的,完全没有相似或相同之处A. 错误B. 正确满分:2 分9. 在读秀中检索到的途书本馆既无纸本,也没有电子全文时,可以通过图书馆文献传递获得此本图书的部分全文A. 错误B. 正确满分:2 分10. CNKI数据库可以对检索结果的学科进行分类展示A. 错误B. 正确满分:2 分11. 网络信息资源可以全面覆盖传统信息资源,两者信息覆盖范围相同A. 错误B. 正确满分:2 分12. 读秀中没有提供此本途书在其他图书馆中收藏的信息A. 错误B. 正确满分:2 分13. 网络信息资源中的ISSN号具有唯一性,使用其进行检索可以提高查准率A. 错误B. 正确满分:2 分14. 当引用其他作者的文章或著作中的内容时,出于对作者知识产权的保护,应该注明其出处A. 错误B. 正确满分:2 分15. 电子期刊的检索与归类比传统纸本期刊的检索与归类要更加快速、方便A. 错误B. 正确满分:2 分16. 利用读秀数据库可以生成文献产生的年代趋势图A. 错误B. 正确满分:2 分17. 读秀学术搜索、CNKI数据库都不提供信息资源的导航浏览功能A. 错误B. 正确满分:2 分18. 开放存取的文献都未经过同行评审过程A. 错误B. 正确满分:2 分19. 要查找某一具体的研究成果应该使用信息资源的外部特征进行查找A. 错误B. 正确满分:2 分20. 在数据库检索中可以将多个字段进行组合检索,以获得精确的检索结果A. 错误B. 正确满分:2 分。
A Chinese Corpus with Word Sense Annotation
A Chinese Corpus with Word Sense AnnotationYunfang Wu, Peng Jin, Yangsen Zhang, Shiwen YuInstitute of Computational Linguistics, Peking UniversityBeijing, China, 100871{wuyf, jandp, yszhang, yusw}@Abstract. This paper presents the construction of a Chinese word sense-taggedcorpus. The resulting lexical resource includes mainly three components: 1) acorpus annotated with word senses; 2) a lexicon containing sense distinctionand description in the feature-based formalism; 3) the linking between the senseentries in the lexicon and CCD synsets. A dynamic model is put forward tobuild the three knowledge bases simultaneously and interactively. The strategyto improve consistency is addressed since consistency is a thorny issue forconstructing semantic resources. The inter-annotator agreement of the sense-tagged corpus is satisfied. The database will grow up to be a powerful lexicalresource both for linguistic researches on Chinese lexical semantics and wordsense disambiguation.Keywords: sense, sense annotation, sense disambiguation, lexical semantics1 IntroductionThere is a strong need for a large-scale Chinese corpus annotated with word senses both for word sense disambiguation (WSD) and linguistic research. Although much research has been carried out, there is still a long way to go for WSD techniques to meet the requirements of practical NLP programs such as machine translation and information retrieval. Although plenty of unsupervised learning algorithms have been put forward, SENSEVAL ([1]) evaluation results show that supervised leaning approaches, in general, are much better than unsupervised ones. Undoubtedly high accuracy WSD needs large-scale word sense tagged corpus as training material ([2]). It was argued that no fundamental progress in WSD could be made until large-scale lexical resources were built ([3]). The absence of a large-scale sense tagged corpus remains one of the most critical bottlenecks for successful WSD programs. In English a word sense annotated corpus SEMCOR(Semantic Concordances)([4]) has been built, which was later trained and tested by many WSD systems and stimulated large amounts of WSD work. In the field of Chinese corpus construction, plenty of attention has been paid to POS tagging and syntactic structures bracketing, for instance the Penn Chinese Treebank ([5]) and Sinica Corpus ([6]), but very limited work has been done with semantic knowledge annotation. The semantic dependency knowledge has been annotated in a Chinese corpus ([7]), which is different from word sense tagging (WST) orientated towards WSD. [8] introduced the Sinica sense-basedlexical knowledge base, but as everyone knows, Chinese pervasive in Taiwan is not the same as mandarin Chinese. SENSEVAL-3 ([1]) provides a Chinese word sense annotated corpus, which contains 20 words and 15 sentences per meaning for most words, but obviously the data is too limited to achieve wide coverage, high accuracy WSD systems.This paper is devoted to building a large-scale Chinese corpus annotated with word senses, and the ambitious goal of the work is to build a comprehensive resource for Chinese lexical semantics. The resulting lexical knowledge base will contain three major components: 1) a corpus annotated with word senses; 2) a lexicon containing sense distinction and description; 3) the linking between the lexicon and the Chinese Concept Dictionary (CCD) ([9]). This paper is so organized as follows. The corpus, the lexicon, CCD as well as the interactive model are presented in detail in section 2 as 4 subsections. Section 3 is devoted to discuss the adapted strategy to improve consistency. Section 4 is the evaluation of the corpus. Finally in section 5 conclusions are drawn and future works are presented.2 The interactive construction2.1 The corpusAt the present stage the sense tagged corpus mainly comes from three months’ texts of People’s Daily, and later we will move to other kinds of texts considering corpus balance. The input data for WST is POS tagged using Peking University’s POS tagger. The high precision of Chinese POS tagging lays a sound foundation for researches on sense annotating. The emphasis of WST therefore falls on the ambiguous words with the same POS. All the ambiguous words in the corpus will be analyzed and annotated, which is different from most of the previous works that focus on limited specific words.2.2 The lexiconDefining sense has long been one of the most heated-discussed topics in lexical semantics. The representation of word senses in the lexicon should be valid and efficient for WSD algorithms in the corpus.Human beings make use of the context, communication surrounding and sometimes even world knowledge to disambiguate word senses. For machine understanding, the last resort for WSD is the context. In this paper the feature-based formalism is adopted to describe word senses. The features, which appear in the form “Attribute =Value”, can incorporate extensive distributional information about a word sense. The many different features together constitute the representation of a sense, but the language definitions of meaning serve only as references for human readers. An example of feature-based description of meaning is shown in figure 1 as for some senses of verb “开/kai1”.Table 1. The feature-based description of some senses of verb “开/kai1”word senses definition subcategory valence subject object syntactic positions 开 1open[NP]2human开 2unfold[~]1~human开 3pay[NP]2human asset开 4drive[NP]2human vehicle开 5establish[NP]2human building开 6hold[NP]2human event开 7write out[NP]2human bill开 8used afterverbs V+~| V+de+~| V+bu+~Thus 开④, for example, can be defined in a set of features:开④ {subcategory=[NP], valence=2, subject=human, object=vehicle, …}The granularity of sense distinction has long been a thorny issue for WSD programs. Based on the features described in the lexicon, those features that are not specified will not be regarded as the distinguishing factors when discriminating word senses, such as the goal, the cause of the action of a verb. That is, finer sense distinctions without clear distributional indicators are ignored. As a result, the senses defining in our lexicon are somewhat coarse-grained compared with the senses in conventional dictionaries, and the inter-annotator agreement is more easily reached.[10] argued that the standard fine-grained division of senses for use by human reader may not be appropriate for the computational WSD task, and that the level of sense-discrimination that NLP needs corresponds roughly to homographs. The sense distinctions in our lexicon lie in between fine-grained senses and homographs.With the feature-based description as the base, the computer can accordingly do the unification in WSD algorithms, and also the human annotator can correctly identify the meaning through reading the sentence context. What’s more, using feature-based formalisms the syntax / semantics interface can be easily realized ([11]).2.3 The interactive constructionTo achieve the ambitious goal of constructing a comprehensive resource for Chinese lexical semantics, the lexicon containing sense descriptions and the corpus annotated with senses are built interactively, simultaneously and dynamically. On one hand, the sense distinctions are relying heavily on the corpus rather than on human’s introspection. The annotator can add or delete a sense entry, and can also edit a sense description according to the real word uses in the corpus. The strategy adapted here conforms to the spirit of lexicon construction nowadays, that is, to commit to corpus evidence for semantic and syntactic generalization just as Berkeley FrameNet project did ([12]). On the other hand, using the sense information specified in the lexicon the human annotators assign semantic tags to all the instances of the word in a corpus. The knowledge base of lexical semantics built here can be viewed either as a corpus in which word senses have been tagged, or as a lexicon in which example sentencescan be found for any sense entry. The lexicon and the corpus can also be split as separate lexical resource to study when needed. SEMCOR provides an important dynamic model ([4]) for sense-tagged programs, where the tagging process is used as a vehicle for improving WordNet coverage. Prior to SEMCOR WordNet has already existed and the tagging process served only as a way to improve the coverage. However, the Chinese lexicon containing sense descriptions oriented towards computer understanding has not yet been built, so the lexicon and the corpus are built in the same procedure. To some extent we can say that the interactive model adapted here is more fundamental than SEMCOR.A software tool is developed in Java to be used as the word sense tagging interface (figure 1). The interface embodies the spirit of interactive construction properly. In the upper section there displays the context in the corpus, with the target ambiguous word highlighted. The range of the context can shift as required. The word senses with feature-based description from the lexicon are displayed in the bottom section. Through reading the context, the human annotator decides to add or delete a sense entry, and can also edit a sense’s feature description. The annotator clicks on the appropriate entry to assign a sense tag to the word occurrence. A sample sentence can be selected from the corpus and added automatically to the lexicon in the corresponding sense entry.Fig. 1. The word sense tagging interface2.4 Linking senses to CCD synsetsThe feature-based description of word meaning as discussed in 2.2 describes mainly the syntagmatic information but cannot include the paradigmatic relations. A lexical knowledge base is well defined only when both the syntagmatic and paradigmatic information are included. WordNet has been widely experimented in WSD researches. We are trying to establish the linking between the sense entries in the lexicon and the synsets in WordNet. CCD is a WordNet-like Chinese lexicon ([9]), which carries themain relations defined in WordNet and is a bilingual concept lexicon with the parallel Chinese-English concepts to be simultaneously displayed. The offset number of the corresponding synset in CCD is used to convey the linking, which expresses the structural information of the tree. Through the linking to CCD, the paradigmatic relations (such as hypernym / hyponym, meronym / holonym) between word senses in the lexicon can be reasonably acquired. After the linking has been established, the many existing WSD approaches based on WordNet can be trained and tested on the Chinese sense tagged corpus.The linking is now done manually. In the word sense tagging interface (figure 1) the different synsets of the word in CCD, along with the hypernyms of each sense (expressed by the first word in a synset), are displayed in the right section. A synset selection window (named Set synsets) contains the offset numbers of the synsets. The annotator clicks on the appropriate box(es) to assign a synset or synsets to the currently selected sense in the lexicon.Table 2. An example of the linking between the sense entries in the lexicon and CCD synsetsword senses definition CCD synset offset subcategory valence subject迷惑 1puzzle00419748, 00421101 [~]1human迷惑 2mislead00361554, 00419112 [NP]2~human3 Keeping consistencyThe lexical semantic resource is manually constructed. Consistency is always an important concern for hand-annotated corpus, and is even critical for the sense tagged corpus due to the subtle meanings to handle.Actually the motivation of feature-based description of word meaning is intended to keep consistency (as discussed in 2.2). The “Attribute = Value” pairs specified in the lexicon clearly describe the distributional context of word senses, which provide the annotator with clear-cut distinctions between different senses. In the tagging process the annotators read through the text and then “do the unification algorithms” according to the features of senses, and then select the most appropriate sense tag. The operational guidelines for sense distinction can be set up based on the features, and thus the unified principle may be followed when distinguishing different words’ senses.Another observation is that the consistency is easier to keep when the annotator manages many different instances of the same word than handle many different words in a specific time frame, because the former method enables the annotator to establish an integrative knowledge of a specific word and its sense distinction. The word sense tagging interface as shown in figure 2 provides the tool (the window named Find/Replace), which allows the annotator to search for a specific word to finish tagging all its occurrences in the same period of time rather than move sequentially through the text as SEMCOR did ([4]).Checking is of course a necessary procedure to keep the consistency. The annotators are also checkers, who check other annotator’s work. A text generally isfirst tagged by one annotator and then verified by two checkers. There are five annotators together in the program, of which three are majored in linguistics and two are majored in computational linguistics. A heated discussion inevitably happens when there exist different views. After discussion, the disagreement between annotators will be greatly reduced. Checking all the instances of a word in a specific time frame will greatly improve the precision and accelerate the speed just as in the process of tagging. A software tool is designed to gather all the occurrences of a word in the corpus into a checking file with the sense KWIC (Key Word in Context) format in sense tags order. Figure 2 illustrates some example sentences containing different senses of verb “开/kai1”. The checking file enables the checker to have a closer examination of how the senses are used and distributed, and to form a general view of how the sense distinctions are made. The inconsistency thus can be reached quickly and correctly.Fig. 2. Some example sentences of verb 开/kai1 4 Evaluation4.1 Inter-annotator agreementThe agreement rate between human annotators on word sense annotation is an important concern both for the evaluation of WSD algorithms and word sense tagged corpus. Suppose that there are n occurrences of ambiguous target words in the corpus. Let m be the number of tokens that are assigned identical sense by two human annotators (the annotator and the checker in this program). Then a simple measure to quantify the agreement rate between two human annotators is p , where /p m n =.Table 3. Inter-annotator agreement POS Num of words nm p N 813 19,19717,757 92.5% V 132 41,69833,900 81.3% ALL 945 60,895 51,657 84.8%Table 3 summarizes the inter-annotator agreement in the sense tagged corpus.Combining nouns and verbs the agreement rate achieves 84.8%, which is comparableto the agreement figures reported in the literatures. [13] mentioned that for theSENSEVAL-3 lexical sample task there was a 67.3% agreement between the first twotaggings. Table 3 also shows that the inter-annotator agreement for nouns is obviouslyhigher than verbs.4.2 The state of the artThe project is now going on. Up to now, 813 nouns and 132 verbs have been analyzedand described in the lexicon with the feature-based formalism. In three-monthPeople’s Daily texts together 60,895 word occurrences have been sense tagged. Bynow this is almost the biggest scale sense tagged corpus for mandarin Chinese.Table 4. The data comparing between SENSEVAL-3 Chinese sense-tagged data and thesense-tagged corpus of People’s DailySENSEVAL-3 Chinese sense-tagged data The Chinese sense-tagged corpusambiguous words 20 945senses 81 2268 word occurrences (including training and test data)1172 608955 Conclusion and future worksThis paper describes the construction of a sense-tagged Chinese corpus. The goal is tocreate a valuable resource both for word sense disambiguation and researches onChinese lexical semantics. Actually some researches on Chinese word senses havebeen carried out based on the corpus, and some supervised learning approaches, suchas SVM, ME, and Bayes algorithms have been trained and tested on the corpus. Asmall part of the sense-tagged corpus has been published in the website . Later we will move to other kinds of texts on account of corpusbalance and data sparseness. To analyze more ambiguous words and to describe moresenses in the lexicon, and to annotate more word instances in the corpus are of coursethe next urgent task. To train algorithms and to develop software tools to realize semi-automatically sense tagging are also next undertakings.Acknowledgments. This research is funded by National Basic Research Program of China (No. 2004CB318102).References1. SENSEVAL: 2. Ng, H. T.: Getting Serious about Word Sense Disambiguation. In Proceedings of ANLP-97Workshop on Tagging Text with Lexical Semantics: Why, What, and How? (1997)3. Veronis, J.: Sense Tagging: Does It Make Sense? In Wilson et al. (Eds). Corpus Linguisticsby the Rule: a Festschrift for Geoffrey Leech. (2003)4. Landes, S., Leacock, C. and Tengi, R.I.: Building Semantic Concordances. In ChristianeFellbaum (Ed.). WordNet: an Electronic Lexical Database. MIT Press, Cambridge (1999) 5. Xue, N., Chiou, F. D. and Palmer, M.: Building a Large-Scale Annotated Chinese Corpus. InProceedings of COLING (2002)6. Huang, Ch. R and Chen, K. J.: A Chinese Corpus for Linguistics Research. In Proceedings ofCOLING (1992)7. Li, M.Q., Li, J. Z., Dong, Zh. D., Wang Z. Y. and Lu, D. J.: Building a Large ChineseCorpus Annotated with Semantic Dependency. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (2003)8. Huang, Ch. R., Chen, Ch. L., Weng C. X. and Chen. K. J.: The Sinica Sense ManagementSystem: Design and Implementation. In Recent advancement in Chinese lexical semantics (2004)9. Liu, Y., Yu, S. W. and Yu, J.S.: Building a Bilingual WordNet-like Lexicon: the NewApproach and Algorithms. In Proceedings of COLING (2002)10. Ide N. and Wilks, Y.: Making Sense About Sense. In Agirre, E., Edmonds, P. (eds.):WordSense Disambiguation: Algorithms and Applications. Springer (2006)11. Nerbonne, J.: Computational Semantics – Linguistics and Processing. In Shalom Lappin.(Ed.) The Handbook of contemporary semantic theory. Foreign Language Teaching and Research Press and Blackwell Publishers Ltd (2001)12.Colin F. B., Fillmore, C. J. and Lowe, J. B.: The Berkeley FrameNet Project. In Proceedingsof COLING-ACL (1998)13. Mihalcea, R., Chklovsky, T. and Kilgarriff, A.: The SENSEVAL-3 English Lexical SampleTask. In Third International Workshop on the Evaluation of Systems for the Semantic analysis of Text (2004)。
WordNetIntroduction
WordNetIntroduction究人员撰写的。
第5章和第6章描述了WordNet的改进;第7章从形式化的概念分析的角度描述了WordNet;第8到第16章讨论了WordNet的各种不同应用。
(一)计算机与词库(computers and lexicon)·一个人即使不接受把人脑比作计算机的隐喻,也一定同意,计算机提供了一个良好的模式演练场,通过它,人们可以测试各种关于人类认知能力的理论模型。
·越来越多的人认识到,一个大的词库对自然语言理解,人工智能的各方面研究都具有重要的价值。
·对大规模机器可读词典的需求同时也带来许多基础问题。
首先是如何构造这样一个词库,是手工编制还是机器自动生成?第二,词典中应包含什么样的信息?第三,词典应如何设计,即信息如何组织,以及用户如何访问?实际上,这些问题涉及到词典的编纂方法,词典的内容,词典的使用方式这一系列非常基础的问题。
(二)构造词库数据库(constructing the lexical database)·构建词典的两种基本方式:自动获取/ 手工编制。
手工构建词典的优点之一是便于创建更为丰富的词条信息;其次是便于控制。
(三)WordNet的内容· WordNet的描述对象包含compound(复合词)、phrasal verb(短语动词)、collocation(搭配词)、idiomatic phrase(成语)、word(单词),其中word 是最基本的单位。
· WordNet并不把词语分解成更小的有意义的单位(这是义素分析法/componential analyses的方法);WordNet也不包含比词更大的组织单位(如脚本、框架之类的单位);由于WordNet把4个开放词类区分为不同文件加以处理,因而WordNet中也不包含词语的句法信息内容;WordNet包含紧凑短语,如bad person,这样的语言成分不能被作为单个词来加以解释。
13-01 WordNet98
To Appear in WordNet:An Electronic Lexical Database and Some of its Applications,Christiane Fellbaum(Ed.),MIT Press. Automated Discovery of WordNet RelationsMarti A.Hearst1IntroductionThe WordNet lexical database is now quite large and offers broad coverage of general lexical relations in English.As is evident in this volume,WordNet has been employed as a resource for many applications in natural language processing(NLP)and information retrieval(IR).However,many potentially useful lexical relations are currently missing from WordNet.Some of these relations,while useful for NLP and IR applications,are not necessarily ap-propriate for a general,domain-independent lexical database.For example, WordNet’s coverage of proper nouns is rather sparse,but proper nouns are often very important in application tasks.The standard way lexicographersfind new relations is to look through huge lists of concordance lines.However,culling through long lists of con-cordance lines can be a rather daunting task(Church and Hanks,1990),so a method that picks out those lines that are very likely to hold relations of interest should be an improvement over more traditional techniques.This chapter describes a method for the automatic discovery of WordNet-style lexico-semantic relations by searching for corresponding lexico-syntactic patterns in large text rge text corpora are now widely avail-able,and can be viewed as vast resources from which to mine lexical,syn-tactic,and semantic information.This idea is reminiscent of what is known as“data mining”in the artificial intelligence literature(Fayyad and Uthu-rusamy,1996),however,in this case the ore is raw text rather than tables of numerical data.The Lexico-Syntactic Pattern Extraction(LSPE)method is meant to be useful as an automated or semi-automated aid for lexicographers and builders of domain-dependent knowledge-bases.The LSPE technique is light-weight;it does not require a knowledge base or complex interpretation modules in order to suggest new WordNet relations.1Instead,promising lexical relations are plucked out of large text collections in their original form.LSPE has the advantage of not requiring the use of detailed inference procedures.However,the results are not comprehensive; that is,not all missing relations will be found.Rather,suggestions can only be made based on what the text collection has to offer in the appropriate form.Recent work in the detection of semantically related nouns via,for ex-ample,shared argument structures(Hindle,1990),and shared dictionary definition context(Wilks et al.,1990)attempts to infer relationships among lexical items by determining which terms are related using statistical mea-sures over large text collections.LSPE has a similar goal but uses a quite different approach,since only one instance of a relation need be encountered in a text in order to suggest its viability.It is hoped that this algorithm and its extensions,by supplying explicit semantic relation information,can be used in conjunction with algorithms that detect statistical regularities to add a useful technique to the lexicon development toolbox.It should also be noted that LSPE is of interest not only for its potential as an aid for lexicographers and knowledge-base builders,but also for what it implies about the structure of English and related languages;namely,that certain lexico-syntactic patterns unambiguously indicate certain semantic re-lations.The following section describes the lexico-syntactic patterns and the LSPE acquisition technique.This is followed by an investigation of the results ob-tained by applying this algorithm to newspaper text,and an analysis of how these results map on to the WordNet noun network.This is followed by a discussion of related work and a chapter summary.2The Acquisition AlgorithmSurprisingly useful information can be found by using only very simple anal-yses techniques on unrestricted text.Consider the following sentence,taken from Grolier’s American Academic Encyclopedia(Grolier,1990):(S1)Agar is a substance prepared from a mixture of red algae, such as Gelidium,for laboratory or industrial use.2Most readers who have never before encountered the term“Gelidium”will nevertheless from this sentence infer that“Gelidium”is a kind of“red algae”.This is true even if the reader has only a fuzzy conception of what red algae is.Note that the author of the sentence is not deliberately defining the term,as would a dictionary or a children’s book containing a didactic sentence like Gelidium is a kind of red algae.However,the semantics of the lexico-syntactic construction indicated by the pattern:(1a)NP0such as NP1{,NP2...,(and|or)NP i}i≥1are such that they imply(1b)for all NP i,i≥1,hyponym(NP i,NP0)Thus from sentence(S1)we concludehyponym(“Gelidium”,“red algae”).This chapter assumes the standard WordNet definition of hyponymy;that is: a concept represented by a lexical item L0is said to be a hyponym of the concept represented by a lexical item L1if native speakers of English accept sentences constructed from the frame An L0is a(kind of)L1.Here L1is the hypernym of L0and the relationship is transitive.Example(1a-b)illustrates a simple way to uncover a hyponymic lexical relationship between two or more noun phrases in a naturally-occurring text. This approach is similar in spirit to the pattern-based interpretation tech-niques used in the processing of machine readable dictionaries,e.g.,(Alshawi, 1987;Markowitz,Ahlswede,and Evens,1986;Nakamura and Nagao,1988). Thus the interpretation of sentence(S1)according to(1a-b)is an application of pattern-based relation recognition to general text.There are many ways that the structure of a language can indicate the meanings of lexical items,but the difficulty lies infinding constructions that frequently and reliably indicate the relation of interest.It might seem that because free text is so varied in form and content(as compared with the somewhat regular structure of the dictionary),it may not be possible to find such constructions.However,there is a set of lexico-syntactic patterns, including the one shown in(1a)above,that indicate the hyponym relation and that satisfy the following desiderata:3(i)They occur frequently and in many text genres.(ii)They(almost)always indicate the relation of interest.(iii)They can be recognized with little or no pre-encoded knowledge.Item(i)indicates that the pattern will result in the discovery of many in-stances of the relation,item(ii)that the information extracted will not be erroneous,and item(iii)that making use of the pattern does not require thetools that it is intended to help build.2.1Lexico-Syntactic Patterns for HyponymyThis section illustrates the LSPE technique by considering the lexico-syntactic patterns suitable for the discovery of hyponym relations.Since only a subsetof the possible instances of the hyponymy relation will appear in a particularform,we need to make use of as many patterns as possible.Below is a listof lexico-syntactic patterns that indicate the hyponym relation,followed by illustrative sentence fragments and the predicates that can be derived fromthem(detail about the environment surrounding the patterns is omitted for simplicity):(2)such NP as{NP,}*{(or|and)}NP...works by such authors as Herrick,Goldsmith,and Shakespeare.=⇒hyponym(“author”,“Herrick”),hyponym(“author”,“Goldsmith”),hyponym(“author”,“Shakespeare”)(3)NP{,NP}*{,}or other NPBruises,...,broken bones or other injuries...=⇒hyponym(“bruise”,“injury”),hyponym(“broken bone”,“injury”)4(4)NP{,NP}*{,}and other NP...temples,treasuries,and other important civic buildings.=⇒hyponym(“temple”,“civic building”),hyponym(“treasury”,“civic building”)(5)NP{,}including{NP,}*{or|and}NPAll common-law countries,including Canada and England...=⇒hyponym(“Canada”,“common-law country”),hyponym(“England”,“common-law country”)(6)NP{,}especially{NP,}*{or|and}NP...most European countries,especially France,England,and Spain.=⇒hyponym(“France”,“European country”),hyponym(“England”,“European country”)hyponym(“Spain”,“European country”)When a relation hyponym(NP0,NP1)is discovered,aside from some lemmatization and removal of unwanted modifiers,the noun phrase is leftas an atomic unit,not broken down and analyzed.If a more detailed in-terpretation is desired,the results can be passed on to a more intelligent or specialized language analysis component.2.2Discovery of New PatternsPatterns(1)–(3)were discovered by hand,by looking through text and noticing the patterns and the relationships indicated.However,to make this approach more extensible,a pattern discovery procedure is needed.Such a procedure is sketched below:1.Decide on a lexical relation,R,that is of interest,e.g.meronymy.52.Derive a list of word pairs from WordNet in which this relation is knownto hold,e.g.,“house-porch”.3.Extract sentences from a large text corpus in which these terms bothoccur,and record the lexical and syntactic context.4.Find the commonalities among these contexts and hypothesize thatcommon ones yield patterns that indicate the relation of interest.This procedure was tried out by hand using just one pair of terms at a time.In thefirst case indicators of the hyponomy relation were found by looking up sentences containing“England”and“country”.With just this pair found new patterns(4)and(5),as well as patterns(1)–(3)which were already known.Next,trying“tank-vehicle”lead to the discovery of a very productive pattern,pattern(6).(Note that for this pattern,even though it has an emphatic element,this does not affect the fact that the relation indicated is hyponymic).Initial attempts at applying Steps1-3,by hand,to the meronymy rela-tion did not identify unambiguous patterns.However,an automatic version of this algorithm has not yet been implemented and so the potential strength of Step4is still untested.There are several ways Step4might be imple-mented.One candidate is afiltering method like that of Manning(1993), tofind those patterns that are most likely to unambiguously indicate the re-lation of interest,or the transformation-based approach of Brill(1995)may also be appropriate.Alternatively,a set of positive and negative training ex-amples,derived from the collection resulting from Steps1-3,could be fed to a machine-learning categorization algorithm,such as C4.5(Quinlan,1986),On the other hand,it may be the case that in English only the hyponym relation is especially amenable to this kind of analysis,perhaps due to its “naming”nature,but this question remains open.2.3Parsing IssuesFor detection of local lexico-syntactic patterns,only a partial parse is nec-essary.This work makes use of a regular-expression-based noun phrase rec-ognizer(Kupiec,1993),which builds on the output of a part-of-speech tag-6ger(Cutting et al.,1991).1Initially,a concordance tool based on the Text DataBase(TDB)system(Cutting,Pedersen,and Halvorsen,1991)is used to extract sentences that contain the lexical items in the pattern of interest, e.g.,“such as”or“or other”.Next,the noun phrase recognizer is run over the resulting sentences,and their positions with respect to the lexical items in the pattern are noted.For example,for pattern(3),the noun phrases that directly precede“or other”are recorded as candidate hyponyms,and the noun phrase following these lexical items is the potential hypernym.2 Thus,it is not necessary to parse the entire sentence;instead just enough local context is examined to ensure that the nouns in the pattern are isolated, although some parsing errors do occur.Delimiters such as commas are important for making boundary determi-nations.One boundary problem that arises involves determining the refer-ent of a prepositional phrase.In the majority of cases thefinal noun in a prepositional phrase that precedes“such as”is the hypernym of the relation. However,there are a fair number of exceptions,as can be seen by comparing (S2a)and(S2b):(S2a)The component parts of flat-surfaced furniture,such as chests and tables,...(S2b)A bearing is a structure that supports a rotating part of a machine,such as a shaft,axle,spindle,or wheel.In(S2a)the nouns in the hyponym positions modify“flat-surfaced furni-ture”,thefinal noun of the prepositional phrase,while in(S2b)they modify “a rotating part”.So it is not always correct to assume that the noun directly preceding“such as”is the full hypernym if it is preceded by a preposition. It would be useful to perform analyses to determine modification tendencies in this situation,but a less analysis-intensive approach is to simply discard sentences in which an ambiguity is possible.Pattern type(4)requires the full noun phrase corresponding to the hy-pernym“other important civic buildings”.This illustrates a difficulty that arises from using free text as the data source,as opposed to a dictionary–1All code described in this chapter is written in Common Lisp and run on Sun SparcStations.2In earlier work(Hearst,1992)a more general constituent analyzer was used.7often the form that a noun phrase occurs in is not the form which should be recorded.For example,nouns frequently occur in their plural form but should be recorded in their singular form(although not always–for exam-ple,the algorithmfinds that“cards”is a kind of“game”–a relation omitted from WordNet1.5,although“card game”is present.).Adjectival quantifiers such as“other”and“some”are usually undesirable and can be eliminated in most cases without making the statement of the hyponym relation erroneous. Comparatives such as“important”and“smaller”are usually best removed, since their meaning is relative and dependent on the context in which they appear.The amount of modification desired depends on the application for which the lexical relations will be used.For building up a basic,general-domain thesaurus,single-word nouns and very common compounds are most appro-priate.For a more specialized domain,more modified terms have their place. For example,noun phrases in the medical domain often have several layers of modification which should be preserved in a taxonomy of medical terms. 3Some ResultsIn an earlier discussion of this acquisition method(Hearst,1992),hyponymy relations between simple noun phrases were extracted from Grolier’s Ency-clopedia(Grolier,1990)and compared to the contents of an early version of WordNet(Version1.1).The acquisition method found many useful re-lations that had not yet been entered into the network(they have all since been added).Relations derived from encyclopedia text tend to be somewhat prototypical in nature,and should in general correspond well to the kind of information that lexicographers would expect to enter into the network.To further explore the behavior of the acquisition method,this section examines results of applying the algorithm to six months worth of text from The New York Times.When comparing a result hyponym(N0,N1)to the contents of WordNet’s noun database,three kinds of outcomes are possible:•Both N0and N1are in WordNet,and the relation hyponym(N0,N1)is already in the database(possibly through transitive closure).8•Both N0and N1are in WordNet,and the relation hyponym(N0,N1)isnot(even through transitive closure);a new hyponym link is suggested.•One or both of N0and N1are not present;these noun phrases and thecorresponding hyponym relation are suggested.As an example of the second outcome,consider the following sentence and derived relation,automatically extracted from The New York Times:(S3)Felonies such as stabbings and shootings,...=⇒hyponym(“shootings”,“felonies”),hyponym(“stabbings”,“felonies”)The text indicates that a shooting is a kind of felony.Figure1shows the portion of the hyponymy relation in WordNet’s noun hierarchy(Version1.5) that includes the synsets felony and shooting.In the current version,despite the fact that there are links between kidnapping,burglary,etc.,and felony, there is no link between shooting,murder,etc.,and felony or any other part of the crime portion of the network.Thus the acquisition algorithm suggests the addition of a potentially useful link,as indicated by the dotted line.This suggestion may in turn suggest still more changes to the network,since it may be necessary to create a different sense of shooting to distinguish“shooting a can”or“hunting”(which are not necessarily crimes)from“shooting people”.Not surprisingly,the relations found in newspaper text tend to be less taxonomic or prototypical than those found in encyclopedia text.For ex-ample,the relation hyponym(“milo”,“crop”)was extracted from the New York Times.WordNet classifies“milo”as a kind of sorghum or grain,but grain is not entered as a kind of crop.The only hyponyms of the appropriate sense of crop in WordNet1.5are catch crop,cover crop,and root crop.One could argue that hyponyms of crop should only be refinements on the notion of crop,rather than lists of types of crops.But in many somewhat similar cases,WordNet does indeed list specific instances of a concept as well as refinements on that concept.For example,hyponyms of book include trade book and reference book,which can be seen as refinements of book,as well as Utopia,which is a specific book(by Sir Thomas More).By extracting information directly from text,relations like hyponym(“milo”,“crop”)are at least brought up for scrutiny.It should be easier for a lexicog-rapher to take note of such relations if they are represented explicitly rather than trying to spot them by sifting through huge lists of concordance lines.9Figure1:A portion of WordNet Version1.5with a new link(dotted line) suggested by an automatically acquired relation:hyponym(“shooting”,“felony”).Many hyponym links are omitted in order to simplify the dia-gram.Often the import of relations found in newspaper text is more strongly influenced by the context in which they appear than those found in ency-clopedia text.Furthermore,they tend more often to reflect subjective judg-ments,opinions,or metaphorical usages than the more established state-ments that appear in the encyclopedia.For example,the asserting that the movie“Gaslight”is a“classic”might be considered a value judgment (although encyclopedias state that certain actors are stars,so perhaps this isn’t so different),and the statement that“AIDS”is a“disaster”might be considered more a metaphorical statement than a taxonomic one.Tables1and2show the results of extracting50consecutive hyponym relations from six months worth of The New York Times using pattern(1a), the“such as”pattern.Thefirst group in Table1are those for which the noun phrases and relation are already present in WordNet.The second group cor-10Description Hypernym Hyponym(s)Relation and terms alreadyappear in WordNet1.5fabric silkgrain barleydisorders epilepsybusinesses nightclubcrimes kidnappingscountries Brazil India Israelvegetables broccoligames checkersregions Texasassets stocksjurisdictions IllinoisTerms appear in Wordnet1.5,relations do not crops milowildlife deer raccoonsconditions epilepsyconveniences showers microwavesperishables fruitagents bacteria virusesfelonies shootings stabbingseuphemisms restrictees detaineesgoods shoesofficials stewardsgeniuses Einstein Newtongifts liquordisasters AIDSmaterials glass ceramicspartner NipponRelation and term does notappear in WordNet1.5companies Volvo Saab(proper noun)institutions Tuftsairlines Pan USAiragencies Clic Zolicompanies Shell11Table1:Examples of useful relations suggested by the automatic acquisition method,derived from The New York Times.Description Hypernym Hyponym(s)Does not appear in WordNet1.5but is perhaps too general things exercisetopics nutritionthings conservationthings popcorn peanutsareas SacramentoContext-specific relations,so probably not of interest Others Meadowbrookfacilities Peachtreecategories drama miniseries comedyclassics Gaslightgenerics RichlandMisleading relations resultingfrom parsing errors tendencies aspirin anticoagulants (usually not detecting competence Nunnthe full NP)organization Bissingerchildren Headstarttitles Batmancompanies sportsagencies Viennajobs computerprojects universitiesTable2:Examples of less felicitous relations also derived from The New York Times.12responds to the situation in which the noun phrases are present in WordNet but the hyponymy relation between them is absent.The third group shows examples in which at least one noun phrase is absent,and so the correspond-ing relation is necessarily absent as well.In this example these relations all involve proper noun hyponyms.Some interesting relations are suggested.For example,the only hyponyms of euphemism in Wordnet1.5are blank,darn,and heck(euphemisms for curse words).Detainee also exists in WordNet,as a hyponym of prisoner, captive which in turn is a hyponym of unfortunate,unfortunate person.If nothing else,this discovered relation points out that the coverage of eu-phemisms in WordNet1.5is still rather sparse,and also suggests another category of euphemism,namely,government designated terms that act as such.Thefinal decision on whether or not to classify“detainee”in this way rests with the lexicographers.Another interesting example is the suggestion of the link between“Ein-stein”and“genius”.Both terms exist in the network(see Figure2),and “Einstein”is recorded as a hyponym of physicist which in turn is a hyponym of scientist and then intellectual.One of the hypernyms of genius is also intellectual.Thus the lexicographers have made genius and scientist siblings rather than specifying one to be a hyponym of the other.This makes sense, since not all scientists are geniuses(although whether all scientists are intel-lectuals is perhaps open to debate as well).The Figure also indicates that philosopher is also a child of intellectual,and individual philosophers appear as hyponyms of philosopher.Hence it does not seem unreasonable to propose a link between the genius synset and particular intellectuals so known.Table2illustrates some of the less useful discovered relations.Thefirst group lists relations that are probably too general to be useful.For example, various senses of“exercise”are classified as act and event in Wordnet1.5, but“exercise”is described as a“thing”in the newspaper text.Although an action can be considered to be a thing,the ontology of WordNet assumes they have separate originating synsets.A similar argument applies to the “conservation”example.The next group in Table2refers to context-specific relations.For exam-ple,there most likely is a facility known as the“Peachtree facility”,but this is important only to a very particular domain.Similarly,the relationship between“others”and“Meadowbrook”most likely makes sense when the ref-erence of“others”is resolved,but not out of context.On the other hand,the13Figure2:A portion of WordNet Version1.5with a new link(dotted line) suggested by an automatically acquired relation:hyponym(“Einstein”,“ge-nius”).Many hyponym links are omitted in order to simplify the diagram. hyponym(“Gaslight”,“classic”)relation is inappropriate in this exact form, it may well be suggesting a useful omitted relation,“classicfilms”.Of course, the main problem with such a relation is the subjective and time-sensitive nature of its membership.Most of the terms in WordNet’s noun database are unmodified nouns or nouns with a single modifier.For this reason,the analysis presented here only extracts relations consisting of unmodified nouns in both the hyper-nym and hyponym roles(although determiners are allowed and a very small set of quantifier adjectives:“some”,“many”,“certain”,and“other”).This restriction is also useful because,as touched on above,the procedure for determining which modifiers are important is not straightforward.Further-more,for the purposes of evaluation,in most cases it is easier to judge the correctness of the classification of unmodified nouns than modified ones.Although not illustrated in this example,the algorithm also produces many suggestions of multi-word term relations,e.g.,hyponym(“data base search”,“disk-intensive operation”).To further elucidate the performance14Frequency Explanation38Some version of the NPs and the correspondingrelation were found in WordNet31The relation did not appear in WordNet andwas judged to be a very good relation(in some casesboth NPs were present,in some cases not)35The relation did not appear in WordNet and wasjudged to be at least a pretty good relation(in some casesboth NPs were present,in some cases not)19The relation was too general8The relation was too subjective,or containedunresolved or inappropriate referents(e.g.,“these”) 34The NPs involved were too long,too specificand/or too context-specific12The relations were repeats of cases counted above22The sentences did not contain the appropriatesyntactic form(e.g.,“all of the above,none of the above,or other”)Table3:Results from200sentences containing the terms“or other”.of the algorithm,200consecutive instances of pattern(3),the“or other”pat-tern,were extracted and evaluated by hand.Sentences that simply contained the two words“or other”(or“or others”were extracted initially,regardless of syntactic context.The most specific form of the noun phrase was looked upfirst;if it did not occur in WordNet,then the leftmost word in the phrase was removed and the phrase that remained was then looked up,and the pro-cess was repeated until only one word was left in the phrase.In each case, the words in the phrase wasfirst looked up as is,and then reduced to their root form using the morphological analysis routines bundled with WordNet.The results were evaluated in detail by hand,and placed into one of eight categories,as shown in Table3.The judgments of whether the relations were“very good”or“pretty good”are meant to approximate the judgment that would be made by a WordNet lexicographer about whether or not to place the relation into WordNet.Of course this evaluation is very subjective,15Hypernym Hyponymfish wildlifebirth defect*health problemmoney laundering*crimestrobe*lightfood aidgrandchild heirdeduction write-offname IDtakeover reorganizationshrimp shellfishcanker sore*lesionTable4:Examples of good relations found using pattern(3)that do not appear in Wordnet1.5.An asterisk indicates that the noun phrase does not appear in WordNet1.5.16so an attempt was made to err on the side of ing this conservative evaluation,104out of the166elible sentences(those which had the correct syntax and were not repeats of already listed relations),or63%, were either already present or strong candidates for inclusion in WordNet.As seen in the examples from New York Times text,many of the suggested relations are more encyclopedic,and less obviously valid as lexico-semantic relations.Yet as WordNet continues to grow,the lexicographers may choose to include such items.In summary,these results suggest that automatically extracted relations can be of use in augmenting WordNet.As mentioned above,the algorithm has the drawback of not guaranteeing complete coverage of the parts of the lexicon that require repair,but it can be argued that some repair is better than none.When using newspaper text,which tends to be more unruly than well-groomed encyclopedia text,a fair number of uninteresting relations are suggested,but if a lexicographer or knowledge engineer makes thefinal decision about inclusion,the results should be quite helpful for winnowing out missing relations.4Related WorkThere has been extensive work on the use of partial parsing for various tasks in language analysis.For example,Kupiec(1993)extracts noun phrases from encyclopedia texts in order to answer closed-class questions,and Jacobs and Rau(1990)use partial parsing to extract domain-depending knowledge from newswire text.In this section,the discussion of related work will focus on efforts to automatically extract lexical relation information,rather than general knowledge.4.1Hand-coded and Knowledge-Intensive Approaches There have been several knowledge-intensive approaches to automated lexical acquisition.Hobbs(1984),when describing the procedure his group followed in order to build a lexicon/knowledge base for an NLP analyzer of a medical text,noted that much of the work was done by looking at the relationships implied by the linguistic presuppositions in the target texts.One of his examples,17。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
WordNet. An electronic lexical database. Edited by Christiane Fellbaum, with a preface by George Miller. Cambridge, MA: MIT Press; 1998. 422 p. $50.00This is a landmark book. For anyone interested in language, in dictionaries and thesauri, or natural language processing, the introduction, Chapters 1- 4, and Chapter 16 are must reading. (Select other chapters according to your special interests; see the chapter-by-chapter review). These chapters provide a thorough introduction to the preeminent electronic lexical database of today in terms of accessibility and usage in a wide range of applications. But what does that have to do with digital libraries? Natural language processing is essential for dealing efficiently with the large quantities of text now available online: fact extraction and summarization, automated indexing and text categorization, and machine translation. Another essential function is helping the user with query formulation through synonym relationships between words and hierarchical and other relationships between concepts. WordNet supports both of these functions and thus deserves careful study by the digital library community.The introduction and part I, which take almost a third of the book, give a very clear and very readable overview of the content, structure, and implementation of WordNet, of what is in WordNet and what is not; these chapters are meant to replace Five papers on WordNet(ftp:///pub/WordNet/5papers.ps), which by now are partially outdated. However, I did not throw out my copy of the Five papers; they give more detail and interesting discussions not found in the book. Chapter 16 provides a very useful complement; it includes a very good overview of WordNet relations (with examples and statistics) and describes possible extensions of content and structure.Part II, about 15% of the book, describes "extensions, enhancements, and new perspectives on WordNet", with chapters on the automatic discovery of lexical and semantic relations through analysis of text, on the inclusion of information on the syntactic patterns in which verbs occur, and on formal mathematical analysis of the WordNet structure.Part III, about half the book, deals with representative applications of WordNet, from creating a "semantic concordance" (a text corpus in which words are tagged with their proper sense), to automated word sense disambiguation, to information retrieval, to conceptual modeling. These are good examples of pure knowledge-based approaches and of approaches where statistical processing is informed by knowledge from WordNet.As one might expect, the papers in this collection (which are reviewed individually below), are of varying quality; Chapters 5, 12, and 16 stand out. Many of the authors pay insufficient heed to the simple principle that the reader needs to understand the overall purpose of work being discussed in order to best assimilate the detail. Repeatedly, one finds small differences in performance discussed as if they meant something, when it is clear that they are not statistically significant (or at least it is not shown that they are), a problem that afflicts much of information retrieval and related research. The application papers demonstrate the many uses of WordNet but also make a number of suggestions for expanding the types of information included in WordNet to make it even more useful. They also uncover weaknesses in the content of WordNet by tracing performance problems to their ultimate causes.The papers are tied together into a coherent whole, and that unity of the book is further enhanced by the presence of an index, which is often missing from such collections. One would wish the index a bit more detailed.. The reader wishing for a comprehensive bibliography on WordNet can find it at /~josephr/wn-biblio.html (and more information on WordNet can be found at /~wn).The book deals with WordNet 1.5, while WordNet 1.6 was released at just about the same time. For the papers on applications it could not be otherwise, but the chapters describing WordNet might have been updated to include the 1.6 enhancements. One would have wished for a chapter or at least a prominent mention of EuroWordNet (www.let.uva.nl/~ewn), a multilingual lexical database using WordNet's concept hierarchy.It would have been useful to enforce common use at least of key terms throughout the papers; for example, the introduction says "WordNet makes the commonly accepted distinction between conceptual-semantic relations, which link concepts, and lexical relations, which link individual words [emphasis added], yet in Chapter 1 George Miller states that "the basic semantic relation in WordNet is synonymy" [emphasis added] and in Chapter 5 Marti Hearst uses the term lexical relationships to refer to both semantic relationships and lexical relationships. Priss (p. 186) confuses the issue further in the following statements: "Semantic relations, such as meronymy, synonymy, and hyponymy, are according to WordNet terminology relations that are defined among synsets. They are distinguished from lexical relations, such as antonymy, which are defined among words and not among synsets." [Emphasis added]. Synonymy is, of course, a lexical relation, and terminology relations are relations between words or between words and concepts, but not relations between concepts.This terminological confusion is indicative of a deeper problem. The authors of WordNet vacillate somewhat between a position that stays entirely in the plane of words and deals with conceptual relationships in terms of the relationships between words, and the position, commonly adopted in information science, of separating the conceptual plane from the terminological plane.What follows is a chapter-by-chapter review. By necessity, this often gets into a discussion how things are done in WordNet, rather than just staying with the quality of the description per se.Part I. The lexical database (p. 1 - 127)George Miller's preface is well worth reading for its account of the history and background of WordNet. It also illustrates the lack of communication between fields concerned with language: Thesaurus makers could learn much from WordNet, and WordNet could learn much from thesaurus-makers. The introduction gives a concise overview of WordNet and a preview of the book. It might still make clear more explicitly the structure of WordNet: The smallest unit is the word/sense pair identified by a sense key (p. 107); word/sense pairs are linked through WordNet's basic relation, synonymy, which is expressed by grouping word/sense pairs into synonym sets or synsets; each synset represents a concept, which is often explained through a brief gloss (definition). Put differently, WordNet includes the relationship word W designates concept C, which is coded implicitly by including W in the synset for C. (This is made explicitin Table 16.1.) Two words are synonyms if they have a designates relationship to the same concept. Synsets (concepts) are the basic building blocks for hierarchies and other conceptual structures in WordNet.The following three chapters each deal with a type of word. In Chapter 1, Nouns in WordNet, George Miller presents a cogent discussion of the relations between nouns: synonymy, a lexical relation, and hyponymy (has narrower concept) and meronymy (has part), semantic relations. Hyponymy gives rise to the WordNet hierarchy. Unfortunately, it is not easy to get well-designed display of that hierarchy. Meronymy (has part) / holonymy (is part of) always are subject to confusion, not completely avoided in this chapter, which stems from ignoring the fact that airplane has-part wing really means "an individual object 1 which belongs to the class airplane has-part individual object 2 which belongs to the class wing". While it is true that bird has-part wing, it is by no means the same wing or even the same type of wing. Thus the strength of a relationship established indirectly between two concepts due to a has-part relationship to the same concept may vary widely. Miller sees meronymy and hyponymy as structurally similar, both going down. But it is equally reasonable to consider a wing as something more general than an airplane and thus construe the has-part relationship s going up, as is done in Chapter 13. Chapter 2, Modifiers in WordNet by Katherine J. Miller describes how adjectives and adverbs are handled in WordNet. It is claimed that nothing like hyponymy/hypernymy is available for adjectives. However, especially for information retrieval purposes, the following definition makes sense: Adj A has narrower term (NT) Adj B if A applies whenever B applies and A and B are not synonyms (the scope of A includes the scope of B and more). For example, red NT ruby (and conversely, ruby BT red) or large NT huge. To the extent that WordNet includes specific color values, their hierarchy is expressed only for the noun form of the color names. Hierarchical relationships between adjectives are expressed as similarity. The treatment of antonymy is somewhat tortured to take account of the fact that large vs small and big vs. little are antonym pairs, but *large vs. little or *prompt vs. slow are not. In general, when speakers express semantic opposition, they do not use just any term for each of the opposite concepts, but they apply lexical selection rules. The simplest way to deal with this problem is to establish an explicit semantic relationship of opposition between concepts and separately record antonymy as a lexical relationship between words (in their appropriate senses). Participial adjectives are maintained in a separate file and have a pointer to the verb unless they fit easily into the cluster structure of descriptive adjectives "rather than" having a pointer to the verb. Why a participial adjective in the cluster structure cannot also have a pointer to the verb is not explained.In Chapter 3, A semantic network of English verbs, Christiane Fellbaum presents a carefully reasoned analysis of the semantic relations between verbs (as presented in WordNet and as presented in alternate approaches) and of the other information for verbs, especially sentence structures, given in WordNet. Much of this discussion is based on results from psycholinguistics. The relationships between verbs are all forms of a broadly defined entailment relationship: troponymy establishes a verb hierarchy (verb A has the troponym verb B if B is a particular way of doing A, corresponding to hyponymy between nouns; the reverse relation is verb B has hypernym verb A); entailment in a more specific sense (to snore entails to sleep); cause; and backward presupposition (one must know before one can forget, not given inWordNet). More refined relationship types could be defined but are not for WordNet, since they do not seem to be psychologically salient and would complicate the interface. The last is not a good reason, since detailed relationship types could be coded internally and mapped to a more general set to keep the interface simple; the additional information would be useful for specific applications. In Section 3.3.1.3 on basic level categories in verbs, the level numbers are garbled (talk is both level L+1 and level L), making the argument unclear.In Chapter 4, Design and implementation of the WordNet lexical database and searching software, Randee I. Tengi lays out the technical issues. While it is hard to argue with a successful system, one might observe that this is not the most user-friendly system this reviewer has seen. The lexicographers have to memorize a set of special characters that encode relationship types (for example, ~ for hyponym, @ for hypernym); {apple, edible_fruit, @} means: apple has the hypernym edible_fruit. The interface is on a par with most online thesaurus interfaces, that is to say, poor. In particular, one cannot see a simple outline of the concept hierarchy; for many thesauri that is available in printed form. P. 108 says that "many relations being reflexive", when it should say reciprocal (such as hyponym / hypernym); a reflexive relation has a precise definition in mathematics (for all a, a R a holds).Chapter 16, Knowledge processing on an extended WordNet by Sanda M. Harabagiu and Dan I. Moldovan is discussed here because it should be read here. Table 16.1 gives the best overview of the relations in WordNet, with a formal specification of the entities linked, examples, properties (symmetry, transitivity, reverse of), and statistics. Table 16.2 gives further types of relationships that can be derived from the glosses (definitions) provided in WordNet. Still further relationships can be derived by inferences on relationship chains. The application of this extended WordNet to knowledge processing (for example, marker propagation, testing a text for coherence) is less well explained.Part II. Extensions, enhancements, and new perspectives on WordNet p. 129 - 196Chapter 5, Automated discovery of WordNet relations, by Marti A. Hearst is a well-written account of using Lexico-Syntactic Pattern Extraction (LSPE) to mine text for hyponym relationships. This is important to reduce the enormous workload in the strictly intellectual compilation of the large lexical databases needed for NLP and retrieval tasks. The following examples illustrate two sample patterns (the implied hyponym relationships should be obvious):red algae, such as Gelidiumbruises, broken bones, or other injuriesPatterns can be hand-coded and/or identified automatically. The results presented are promising. In the discussion of Automatic acquisition from corpora (p. 146, under Related work), only fairly recent work from computational linguistics is cited. Resnik, in Chapter 10, cites work back to 1957. There is also early work done in the context of information retrieval, for example, Giuliano 1965; for more references on early work see Soergel 1974, Chapter H.Chapter 6, Representing verb alternations in WordNet, by Karen T. Kohl, Douglas A. Jones, Robert C. Berwick, and Naoyuki Nomura discusses a format for representing syntactic patterns of verbs in WordNet. It is clearly written for linguists and a heavy read for others. It is best to look at the appendix first to get an idea of the ultimate purpose.Chapter 7, The formalization of WordNet by methods of relational concept analysis, by Uta E. Priss is somewhat of an overkill in formalization. Formal concept analysis might be a useful method, but the starkly abbreviated presentation given here is not enough to really understand it. The paper does reveal a proper analysis of meronymy, once one penetrates the formalism and the notation. The WordNet problems identified on p. 191-195 are quite obvious to the experienced thesaurus builder once the hierarchy is presented in a clear format and could in any event be detected automatically based on simple componential analysis.Part III Applications of WordNet. p. 197-406Chapters 8 Building semantic concordances, by Shari Landes, Claudia Leacock, and Randee I. Tengi and Chapter 9, Performance and confidence in a semantic annotation task, by Christiane Fellbaum, Joachim Grabowski, and Shari Landes both deal with manually tagging each word in a corpus (in the example, 103 passages from the Brown corpus and the complete text of Crane's The red badge of courage) with the appropriate WordNet sense. This is useful for many purposes: detecting new senses, obtaining statistics on sense occurrence, testing the effectiveness of correct disambiguation for retrieval or clustering of documents, and training disambiguation algorithms. The tagged corpus is available with WordNet.Chapter 10, WordNet and class-based probabilities, by Philip Resnik, presents an interesting and discerning discourse on "how does one work with corpus-based statistical methods in the context of a taxonomy?" or, put differently, how does one exploit the knowledge inherent in a taxonomy to make statistical approaches to language analysis and processing (including retrieval) more effective. However, the real usefulness of the statistical approach does not appear clearly until Section 10.3, where the approach is applied to the study of selectional preferences of verbs (another example of mining semantic and lexical information from text), and thus the reader lacks the motivation to understand the sophisticated statistical modeling. The reader deeply interested in these matters might prefer to turn to the fuller version in Resnik's thesis (access from /~Resnik),Chapters 11, Combining local context and WordNet similarity for word sense identification, by Claudia Leacok and Martin Chodorov and 13, Lexical chains as representations of context for the detection and correction of malapropisms, by Graeme Hirst and David St-Onge present methods for sense disambiguation. Chapter 11 uses a statistical approach, computing from a sense-tagged corpus (the training corpus) three distributions that measure the associations of a word in a given sense with part-of-speech tags, open-class words, and closed-class words. These distributions are then used to predict the sense of a word in arbitrary text. The key idea of the paper is very close to the basic idea of Chapter 10: exploit knowledge to improve statistical methods, in this case use knowledge on word sense similarity gleaned from WordNet to extend the usefulness of the training corpus by using the distributions for a sense A1 of word A todisambiguate word B, one of whose senses is similar to A1. Results of experiments are inconclusive (they lack statistical significance), but he idea seems definitely worth pursuing. Hirst and St-Onge take an entirely different approach. Using semantic relations from WordNet they build what they call lexical chains (semantic chain would be a better term) of words that occur in the text. Initially there are many competing chains, corresponding to different senses of the words occurring in the text, but eventually some chains grow long and others do not, and the word senses in the long chains are selected. (Instead of chains one might use semantic networks.) Various patterns of direct and indirect relationships are defined as allowable chain links, and each pattern has a given strength. In the paper, this method is used to detect and correct malapropisms, defined here as misspellings that are valid words (and therefore not detected by a spell checker), for example, “Much of the data is available toady electronically” or “Lexical relations very in number within the text”. (Incidentally, this sense of malapropism is not found in standard dictionaries, including WordNet; in a book on WordNet, one should use language carefully.) The problem of detecting misspellings that are themselves words is the same as the homonym problem: If one accepts, for example, both today and toady as variations of the word today and of the word toady, then [today. toady] becomes a homonym having all the senses of the two words that are misspellings of each other; that is to say, the occurrence of either word in the text may indicate any of the senses of both words. Finding the sense that fits the context selects the form correctly associated with that sense. This is an ingenious application of sense disambiguation methods to spell checking.Chapter 14, Temporal indexing through lexical chaining, by Reem Al-Halimi and Rick Kazman uses a very similar approach to indexing text, an idea that seems quite good (as opposed to the writing). The texts in question are transcripts from video conferences, but that is tangential and their temporal nature is not considered in the paper. Instead of chains they build lexical trees (again, semantic tree would be a better name, and why not use more general semantic networks).A text is represented by one or more trees (presumably each tree representing a logical unit of the text). Retrieval then is based on these trees as target objects. A key point in this method, not stressed in the paper, is the word sense disambiguation that comes about in building the semantic trees. The method appears promising, but no retrieval test results are reported.In Chapter 12, Using WordNet for text retrieval, Ellen M. Voorhees reports on retrieval experiments using WordNet for query term expansion and word sense disambiguation. No conclusions as to the usefulness of these methods should be drawn from the results. The algorithms for query term expansion and for word sense disambiguation performed poorly —partially due to problems in the algorithms themselves, partially due to problems in WordNet —so it is not surprising that retrieval results were poor. Later work on improved algorithms, including work by Voorhees, is cited.In Chapter 15, COLOR-X: Using knowledge from WordNet for conceptual modeling [in software engineering], J. F. M. Burg and R. P. van de Riet describe the use of WordNet to clarify word senses in software requirement documents and for detecting semantic relationships between entity types and relationship types in an E-R model. The case they make is not very convincing. They also bring the art of making acronym soup to new heights, misleading the reader in the process (the chapter has nothing to do with colors).All in all, this is a useful collection of papers and a rich source of ideas.ReferencesGiuliano, Vincent E. 1965 The interpretation of word associations. In Stevens, Mary E.; Giuliano, Vincent E.; Heilprin, Lawrence B., eds. Statistical association methods for mechanical documentation. Washington, DC: Government Printing Office; 1965 December. 261 p. (National Bureau of Standards Miscellaneous Publication 269)Soergel, Dagobert. 1974. Indexing languages and thesauri: Construction and maintenance. Los Angeles, CA: Melville; 1974. 632 p., 72 fig., ca 850 ref. (Wiley Information Science Series)。