Ontology-driven Information Retrieval in FF-Poirot
基于知网的词汇语义相似度计算1
我们的工作主要包括: 1. 研究《知网》中知识描述语言的语法,了解其描述一个词义所用的多个义 原之间的关系,区分其在词语相似度计算中所起的作用;我们采用一种更
1 *
+
本项研究受国家重点基础研究计划(973)支持,项目编号是 G1998030507-4 和 G1998030510。 北京大学计算语言学研究所 & 中国科学院计算技术研究所 E-mail: liuqun@ Institute of Computational Linguistics, Peking University & Institute of Computing Technology, Chinese Academy of Science 中国科学院计算技术研究所 E-mail: lisujian@ Institute of Computing Technology, Chinese Academy of Sciences
基于《知网》的词汇语义相似度计算1 Word Similarity Computing Based on How-net
刘群* ﹑李素建+
Qun LIU , Sujian LI
摘要
词义相似度计算在很多领域中都有广泛的应用,例如信息检索、信息抽取、文 本分类、词义排歧、基于实例的机器翻译等等。词义相似度计算的两种基本方 法是基于世界知识(Ontology)或某种分类体系(Taxonomy)的方法和基于统 计的上下文向量空间模型方法。这两种方法各有优缺点。 《知网》是一部比较详尽的语义知识词典,受到了人们普遍的重视。不过,由 于《知网》中对于一个词的语义采用的是一种多维的知识表示形式,这给词语 相似度的计算带来了麻烦。这一点与 WordNet 和《同义词词林》不同。在 WordNet 和《同义词词林》中,所有同类的语义项(WordNet 的 synset 或《同 义词词林》的词群)构成一个树状结构,要计算语义项之间的距离,只要计算 树状结构中相应结点的距离即可。而在《知网》中词汇语义相似度的计算存在 以下问题: 1. 2. 每一个词的语义描述由多个义原组成; 词语的语义描述中各个义原并不是平等的,它们之间有着复杂的关系,通 过一种专门的知识描述的词汇语义相似度计算
知识管理国家标准(ppt 86页)
Dublin Core HL7(描述医学网络资源的元数据 ) 教育资源元数据 机读目录(MARC)
11
XML
性质
W3C于1998年2月发布的一种标准 是SGML的一个简化子集 可扩展标记语言
特点
较好地解决了HTML无法表达数据内容等问题 允许各个组织、个人建立适合自己需要的标记
可以检测XML文档的结构是否正确
例如描述一组<表>,其中每个<表>又可以包含若干个 <项>
DTD中应该有语句: <! ELEMENT表(项) + > <! ELEMENT项 (#PCDATA) >
生成的表 :<表><项>管乐</项><项>弦乐</项><项>器乐</ 项><表>
29
DTD 引用
混合元素类型
<! ELEMENT元素名 (#PCDATA |子元素名1 |子元素名2 | …)> 元素中可以包含文本 文本之间可以有选择地插入子元素,子元素出现的顺序和次数不受限制
31
XML Schema
DTD缺点
采用了非XML的语法规则 不支持数据类型 扩展性较差
集合
12
XML与HTML比较
文档的3个要素
数据、结构以及显示方式
HTML
显示方式内嵌在数据中 在创建文本时,要时时考虑输出格式 创建文档的重复工作量大 不易抽取语义信息
XML
显示格式从数据内容中独立出来,保存在样式单文件 (Style Sheet)中
本体的概念和应用总结
本体的概念和应⽤总结⼀、Ontology 的定义:Ontology 是⼀种能在语义和知识层次上描述信息系统的概念模型建模⼯具。
Ontology 是对概念模型的明确的、形式化的、可共享的规范。
这包含4层含义:概念模型( conceptualization)、明确(explicit)、形式化( formal)和共享(share)。
概念模型:指通过抽象出客观世界中⼀些现象( Phenomenon)的相关概念⽽得到的模型。
概念模型所表现的含义独⽴于具体的环境状态。
明确:指所使⽤的概念及使⽤这些概念的约束都有明确的定义。
形式化:指Ontology 是计算机可读的(即能被计算机处理)。
共享:指Ontology 中体现的是共同认可的知识, 反映的是相关领域中公认的概念集,即Ontology 针对的是团体⽽⾮个体的共识。
Ontology 的⽬标是捕获相关领域的知识,提供对该领域知识的共同理解,确定该领域内共同认可的词汇,并从不同层次的形式化模式上给出这些词汇(术语)和词汇间相互关系的明确定义。
补充1:在与领域的本体概念计算机科学信息科学在与领域,理论上,本体是指⼀种“形式化的,对于共享概念体系的明确⽽⼜详细的说明”。
本体提供的是⼀种共享词表,也就是特定领域之中那些存在着的或概念及其属性和;或者说,本体就是⼀种特殊类型的,具有结构化的特点,且更加适合于在之中使⽤;或者说,本体实际上就是对特定之中某套及其相互之间的形式化表达(formal representation)。
计算机科学信息科学对象类型相互关系术语集计算机系统领域概念关系⼆、Ontology 的建模元语Perez 等⼈认为Ontology 可以按分类法来组织,他归纳出Ontology 包含5个基本的建模元语(Modeling Primitive)。
这些元语分别为:类(classes),关系(relations),函数(functions),公理(axioms)和实例(instances)。
面向问题导向的学术文献搜索引擎研究
面向问题导向的学术文献搜索引擎研究万连城【摘要】针对学术搜索引擎的使用、查询和检索模型尚待深入研究的问题,研究了由学术搜索引擎接收的查询的分布,并且提出了一种查询识别方法.文中分析了学术搜索查询,并将其分为导航查询和信息查询.将导航查询限定为用户寻找特定学术文档的查询,在此条件下,通过引入一组新特征的机器学习方法来识别此类的查询,采用梯度提高树(GBT)来训练识别导航查询的分类器,结果显示在召回率为0.68的条件下,准确率为0.68,并且获得了0.677的F评分.%This paper investigates the distribution of queries received by academic search engines and presents a method of query recognition for the problem that academic search engine usage, query and retrieval models are not well studied.This paper studies the academic search queries and divides them into navigation queries and information queries.In this paper, the navigation query is defined as a query to find a specific academic document.Under this condition, a new set of machine learning methods is introduced to identify the query.The Gradient Boosted Trees (GBT) is used to train the classifiers , The results showed that the recall was 0.68, the precision was 0.68, and the F score of 0.677 was obtained.【期刊名称】《电子科技》【年(卷),期】2016(029)012【总页数】4页(P142-144,147)【关键词】学术搜索;导航;查询;机器学习【作者】万连城【作者单位】西安电子科技大学期刊中心,陕西西安 710071【正文语种】中文【中图分类】TP305学术搜索引擎已成为许多研究人员起草研究手稿或着手研究提案时的起点。
学术文摘创新点挖掘的认知分析方法
情报学报2021年5月第40卷第5期Journal of the China Society for Scientific and Technical Information,May2021,40(5):489-499学术文摘创新点挖掘的认知分析方法温浩,何茜茹(西安建筑科技大学信息与控制工程学院,西安710055)摘要为了克服学术文摘表达创新点的多样性和丰富性带给知识挖掘算法的困难,本文提出了学术文摘创新点挖掘的认知分析方法。
该方法包括:学术文摘创新点报道认知分析、词汇语义分布一致性认知分析、谓语动词语义理解认知分析、语用功能分类认知分析和句法隐含认知分析。
研究结果表明,这五种认知分析可形成文摘挖掘的五个层次:信息检索层次、本体构建层次、语义挖掘层次、语用分类层次和对象隐含层次。
本文的研究方法为利用机器学习算法处理自然语言表达模式提供了认知分析方法,提高了文摘创新点分类算法的准确率和覆盖率,提高了文摘“问题、方法、结果”三元组挖掘的效率,为建立基于三元组知识库的智能问答系统提供了理论和方法的指导作用。
关键词创新点挖掘;认知分析;自然语言处理Cognitive Analysis Method for Mining InnovationPoints in Academic AbstractsWen Hao and He Qianru(School of Information and Control Engineering,Xi an University of Architecture Technology,Xi an710055)Abstract:To resolve difficulties in knowledge mining algorithms due to the diversity and richness of academic abstracts' expression innovation points,this paper proposes a cognitive analysis method for innovation points mining in academic ab‐stracts.The method comprises the following cognitive analyses:academic abstract innovation point report,lexical seman‐tic distribution consistency,predicate verb semantic understanding,pragmatic function classification,and syntactic implic‐it.The research results show that these five kinds of analyses can form five levels of abstract mining,namely information retrieval,ontology construction,semantic mining,pragmatic classification,and object hidden levels.This cognitive analy‐sis method processes natural language expression patterns using machine learning algorithms and improves the accuracy and coverage of the classification algorithm for abstract innovation points.Furthermore,it improves the efficiency of ab‐stract“question,method,and result”triples mining.Such an intelligent question answering system based on the triad knowledge base improves the guiding roles of theory and method.Key words:innovation point mining;cognitive analysis;natural language processing1基于文摘创新点的知识问答服务如何有效利用海量文本学术资源为人类提供最直接的内容知识问答服务,而不仅仅是信息检索服务,一直是人工智能在自然语言处理领域研究的目标。
《信息技术英语》-unit3
Unit 3Transition to Modern Information ScienceChapter One&Part4 Extensive Reading @Part 1 Notes to Text@Part5Notes to Passage & Part 2 Word Study@Part3 Practice on Text @Part6 Practice on Passage@Part 1 Notes to TextTransition to Modern Information Science1)With the 1950‘s came increasing awareness of the potentialof automatic devices for literature searching and informationstorage and retrieval.随着二十世纪五十年代的来临,人们对用于文献资料搜索、信息储存与检索的自动装置的潜力认识日益增长。
注释:该句是一个完全倒装句。
主语是awareness;介词短语With the 1950‘s是状语,修饰谓语动词came。
2)As these concepts grew in magnitude and potential, so did thevariety of information science interests. 由于这些概念的大量增长,潜移默化,对信息科学研究的各种兴趣也亦如此。
注释:介词短语in magnitude and potential作方式状语,意思是“大量地,潜移默化地”;后面的主句因为so放在句首而倒装。
So指代前文的grew in magnitude and potential。
3) Grateful Med at the National Library of Medicine美国国家医学图书馆数据库注释:Grateful Med是对另一个NLM(国家医学图书馆)基于网络的查询系统的链接。
TF-IDF排序详解
TF-IDF排序详解TF/IDF(term frequency/inverse document frequency) 的概念被公认为信息检索中最重要的发明。
⼀. TF/IDF描述单个term与特定document的相关性TF(Term Frequency): 表⽰⼀个term与某个document的相关性。
公式为: 这个term在document中出现的次数除以该document中所有term出现的总次数.IDF(Inverse Document Frequency)表⽰⼀个term表⽰document的主题的权重⼤⼩。
主要是通过包含了该term的docuement的数量和docuement set的总数量来⽐较的。
出现的次数越多,权重越⼩。
公式是log(D/Dt) D是docuemnt set的总数量, Dt是包含了该term的document的总数。
这样,根据关键字k1,k2,k3进⾏搜索结果的相关性就变成TF1*IDF1 + TF2*IDF2 + TF3*IDF3。
⽐如document1的term总量为1000,k1,k2,k3在document1出现的次数是100,200,50。
包含了 k1, k2, k3的docuement总量分别是1000, 10000,5000。
document set的总量为10000。
TF1 = 100/1000 = 0.1TF2 = 200/1000 = 0.2TF3 = 50/1000 = 0.05IDF1 = log(10000/1000) = log(10) = 2.3IDF2 = log(10000/100000) = log(1) = 0;IDF3 = log(10000/5000) = log(2) = 0.69这样关键字k1,k2,k3与docuement1的相关性= 0.1*2.3 + 0.2*0 + 0.05*0.69 = 0.2645其中k1⽐k3的⽐重在document1要⼤,k2的⽐重是0.TF/IDF 的概念就是⼀个特定条件下、关键词的概率分布的交叉熵(Kullback-Leibler Divergence).⼆. ⽤TF/IDF来描述document的相似性。
本体聚合法的聚合场所
本体聚合法的聚合场所本体聚合法的聚合场所:探寻知识的源泉引言当今,信息爆炸的时代,人们面临着庞杂的知识信息和短暂的注意力。
在这个背景下,本体聚合法作为一种知识管理和整合的方法,成为了帮助人们应对信息过载的利器。
但是,本体聚合法的聚合场所却是一个备受关注的话题。
本文将从多个角度,探讨本体聚合法的聚合场所。
一、数字化环境:新时代的知识灵感汇集之地随着互联网的发展,数字化环境成为了本体聚合法的理想聚合场所。
网络空间以其开放、共享和多样化的特点,为本体聚合法提供了无穷无尽的资源和知识灵感。
1. 互联网信息库互联网拥有广泛的信息资源,如百度百科、维基百科等。
这些在线知识库通过本体聚合法的方式,将分散的知识条目整合成完整的本体,实现了知识的集中和统一。
2. 开放式协作平台开放式协作平台如Wikipedia,将知识的生成和整合开放给全球用户。
本体聚合法可以将不同用户的贡献进行整合,生成更为全面和准确的知识图谱。
二、学术界:学识交流的聚合场所学术界以其深度和广度著称,也成为了本体聚合法的聚合场所。
1. 学术论文学术论文是构建本体知识的重要来源之一。
通过文献综述和学术交流,本体聚合法可以引用和整合先驱学者的研究成果,构建更为完整和深入的本体。
2. 学术会议和讨论组学术会议和讨论组为学者提供了交流和合作的平台。
通过参与学术活动,本体聚合法可以结识同领域的专家,获取最新的研究成果,并借助本体技术将这些成果进行整合。
三、社交媒体:知识共享的聚合场所社交媒体作为一个广泛传播信息的平台,也为本体聚合法提供了新的聚合场所。
1. 微博和微信公众号微博和微信公众号是人们获取知识的重要途径。
通过关注领域内的专家和意见领袖,本体聚合法可以通过转发、评论和讨论的方式,将知识碎片进行聚合和整合,形成更为全面和系统的本体。
2. Q&A社区Q&A社区如知识、Quora等,为人们提供了问答和讨论的平台。
本体聚合法可以在这些社区中获取问题和答案,并进行整合,形成对问题的深入理解和解决方案。
论Ontology在信息系统研究中的两重性
作 者 简 介 : 知 津 (9 7 ) 男 , 授 , 士 生 导 师 , 表 论 文 3 0余 篇 , 王 14 一 , 教 博 发 0 出版 著 作 2 9部 ; 鑫 (9 6 ) 男 ,0 9级 情 报 学 硕 士 研 金 1 8一 , 2 0
究 生 , 表论 文 4篇 ; 文 爽 (9 5 ) 男 ,0 9级情 报 学硕 士研 究 生 , 表 论 文 4篇 。 发 王 18 一 , 2 0 发
1 Onoo y的 概念 及特 征 tlg
共享 的重 要 组件 。
O t o y 词最 早 产 生 于 l no g 一 l 7世 纪 , 用 于 哲学 应 领域, 与形 而 上学 和 “ 第一 哲学 ” 同义 词 。 是 在哲 学范 畴 , no g 可 以翻 译 为 “ 体论 ” 该 理 论 是 对 客观 O toy l 本 , 存在 的一 个 系统 的解 释或 说 明 ,它关 心 的是 客 观现 实 的抽象 本质 , 一个 研 究 “ 在 ” 是 存 的理 论 。 它关 注于 事 物存在 的 原 因 , 不是 存在 的结 果 。 而 本体论 确 立 了 种追 寻 初 始 本 原 、 足 理 由、 终 同一 性 、 高价 充 最 最 值原 理 的哲学 探 索 的道路 Ⅱ。 】 作 为 一 个 曾 经 用 于 哲 学 上 的概 念 . no g O tl y最 o 早用 于哲 学 以外 的 领域 是 人工 智 能 。现在 广 泛应 用 于知识 工 程 、 知识 表 示 、 息检 索 、 息摘 要 、 信 信 知识 管 理等 领域 , 国外对 本 体论 的研 究非 常 活跃 . 至被 应 甚 用到企业 集 成 、 自然语 言翻译 、 药 、 医 电子 商务 、 理 地 信 息 系统 、 法律 信 息 系统 、 生物 信 息系统 等 [。 2 ] 其实 , nooy就 是 通 过 对 于概 念 、 O tl g 术语 及 其 相 互关 系 的规 范化描 述 ,勾 画 出某一 领 域 的基 本知识 体 系 和描 述 语 言 。O tlg nooy的 目标 是捕 获相 关 领域 的知识 , 供对 该 领 域知 识 的共 同理解 , 定该 领域 提 确 内共 同认 可 的词汇 ,并从 不 同层次 的形 式 化模 式上 给 出这些 术语 和 术语 问相互 关 系 的明确定 义 。 O t o y 有 以下 特 征 : no g 具 l () 1 使用 范 围十 分广 泛 。O tl y能够 在不 同的 noo g 建 模方 法 、 言 、 式 和 工 具 之 间进 行 转 换 和 映 射 . 语 范 在 不 同的系 统之 间具 有 可继 承性 和互 操作 性 。 ( ) 功能 上与 数 据库 具 有一 定 的相 似 性 , 在 2在 但 所 能表达 的知识 方 面 ,却 比数 据 库 丰富 很多 。一 方 面, 定义 O tlg nooy的语 言 , 词 法 和语 义 两个 层 面上 在 所 能 表达 的信 息 与数 据 库相 比 , 要 丰富 很 多 ; 一 都 另
自然语言处理及计算语言学相关术语中英对译表三_计算机英语词汇
multilingual processing system 多语讯息处理系统multilingual translation 多语翻译multimedia 多媒体multi-media communication 多媒体通讯multiple inheritance 多重继承multistate logic 多态逻辑mutation 语音转换mutual exclusion 互斥mutual information 相互讯息nativist position 语法天生假说natural language 自然语言natural language processing (nlp) 自然语言处理natural language understanding 自然语言理解negation 否定negative sentence 否定句neologism 新词语nested structure 崁套结构network 网络neural network 类神经网络neurolinguistics 神经语言学neutralization 中立化n-gram n-连词n-gram modeling n-连词模型nlp (natural language processing) 自然语言处理node 节点nominalization 名物化nonce 暂用的non-finite 非限定non-finite clause 非限定式子句non-monotonic reasoning 非单调推理normal distribution 常态分布noun 名词noun phrase 名词组np (noun phrase) completeness 名词组完全性object 宾语{语言学}/对象{信息科学}object oriented programming 对象导向程序设计[面向对向的程序设计]official language 官方语言one-place predicate 一元述语on-line dictionary 线上查询词典 [联机词点]onomatopoeia 拟声词onset 节首音ontogeny 个体发生ontology 本体论open set 开放集operand 操作数 [操作对象]optimization 最佳化 [最优化]overgeneralization 过度概化overgeneration 过度衍生paradigmatic relation 聚合关系paralanguage 附语言parallel construction 并列结构parallel corpus 平行语料库parallel distributed processing (pdp) 平行分布处理paraphrase 转述 [释意;意译;同意互训]parole 言语parser 剖析器 [句法剖析程序]parsing 剖析part of speech (pos) 词类particle 语助词part-of relation part-of 关系part-of-speech tagging 词类标注pattern recognition 型样识别p-c (predicate-complement) insertion 述补中插pdp (parallel distributed processing) 平行分布处理perception 知觉perceptron 感觉器 [感知器]perceptual strategy 感知策略performative 行为句periphrasis 用独立词表达perlocutionary 语效性的permutation 移位petri net grammar petri 网语法philology 语文学phone 语音phoneme 音素phonemic analysis 因素分析phonemic stratum 音素层phonetics 语音学phonogram 音标phonology 声韵学 [音位学;广义语音学] phonotactics 音位排列理论phrasal verb 词组动词 [短语动词]phrase 词组 [短语]phrase marker 词组标记 [短语标记]pitch 音调pitch contour 调形变化pivot grammar 枢轴语法pivotal construction 承轴结构plausibility function 可能性函数pm (phrase marker) 词组标记 [短语标记] polysemy 多义性pos-tagging 词类标记postposition 方位词pp (preposition phrase) attachment 介词依附pragmatics 语用学precedence grammar 优先级语法precision 精确度predicate 述词predicate calculus 述词计算predicate logic 述词逻辑 [谓词逻辑]predicate-argument structure 述词论元结构prefix 前缀premodification 前置修饰preposition 介词prescriptive linguistics 规定语言学 [规范语言学] presentative sentence 引介句presupposition 前提principle of compositionality 语意合成性原理privative 二元对立的probabilistic parser 概率句法剖析程序problem solving 解决问题program 程序programming language 程序设计语言 [程序设计语言] proofreading system 校对系统proper name 专有名词prosody 节律prototype 原型pseudo-cleft sentence 准分裂句psycholinguistics 心理语言学punctuation 标点符号pushdown automata 下推自动机pushdown transducer 下推转换器qualification 后置修饰quantification 量化quantifier 范域词quantitative linguistics 计量语言学question answering system 问答系统queue 队列radical 字根 [词干;词根;部首;偏旁]radix of tuple 元组数基random access 随机存取rationalism 理性论rationalist (position) 理性论立场 [唯理论观点]reading laboratory 阅读实验室real time 实时real time control 实时控制 [实时控制]recursive transition network 递归转移网络reduplication 重叠词 [重复]reference 指涉referent 指称对象referential indices 指针referring expression 指涉词 [指示短语]register 缓存器[寄存器]{信息科学}/调高{语音学}/语言的场合层级{社会语言学}regular language 正规语言 [正则语言]relational database 关系型数据库 [关系数据库]relative clause 关系子句relaxation method 松弛法relevance 相关性restricted logic grammar 受限逻辑语法resumptive pronouns 复指代词retroactive inhibition 逆抑制rewriting rule 重写规则rheme 述位rhetorical structure 修辞结构rhetorics 修辞学robust 强健性robust processing 强健性处理robustness 强健性schema 基朴school grammar 教学语法scope 范域 [作用域;范围]script 脚本search mechanism 检索机制search space 检索空间searching route 检索路径 [搜索路径]second order predicate 二阶述词segmentation 分词segmentation marker 分段标志selectional restriction 选择限制semantic field 语意场semantic frame 语意架构semantic network 语意网络semantic representation 语意表征 [语义表示] semantic representation language 语意表征语言semantic restriction 语意限制semantic structure 语意结构semantics 语意学sememe 意素semiotics 符号学sender 发送者sensorimotor stage 感觉运动期sensory information 感官讯息 [感觉信息]sentence 句子sentence generator 句子产生器 [句子生成程序]sentence pattern 句型separation of homonyms 同音词区分sequence 序列serial order learning 顺序学习serial verb construction 连动结构set oriented semantic network 集合导向型语意网络 [面向集合型语意网络]sgml (standard generalized markup language) 结构化通用标记语言shift-reduce parsing 替换简化式剖析short term memory 短程记忆sign 信号signal processing technology 信号处理技术simple word 单纯词situation 情境situation semantics 情境语意学situational type 情境类型social context 社会环境sociolinguistics 社会语言学software engineering 软件工程 [软件工程]sort 排序speaker-independent speech recognition 非特定语者语音识别spectrum 频谱speech 口语speech act assignment 言语行为指定speech continuum 言语连续体speech disorder 语言失序 [言语缺失]speech recognition 语音辨识speech retrieval 语音检索speech situation 言谈情境 [言语情境]speech synthesis 语音合成speech translation system 语音翻译系统speech understanding system 语音理解系统spreading activation model 扩散激发模型standard deviation 标准差standard generalized markup language 标准通用标示语言start-bound complement 接头词state of affairs algebra 事态代数state transition diagram 状态转移图statement kernel 句核static attribute list 静态属性表statistical analysis 统计分析statistical linguistics 统计语言学statistical significance 统计意义stem 词干stimulus-response theory 刺激反应理论stochastic approach to parsing 概率式句法剖析 [句法剖析的随机方法]stop 爆破音stratificational grammar 阶层语法 [层级语法]string 字符串[串;字符串]string manipulation language 字符串操作语言string matching 字符串匹配 [字符串]structural ambiguity 结构歧义structural linguistics 结构语言学structural relation 结构关系structural transfer 结构转换structuralism 结构主义structure 结构structure sharing representation 结构共享表征subcategorization 次类划分 [下位范畴化] subjunctive 假设的sublanguage 子语言subordinate 从属关系subordinate clause 从属子句 [从句;子句] subordination 从属substitution rule 代换规则 [置换规则] substrate 底层语言suffix 后缀superordinate 上位的superstratum 上层语言suppletion 异型[不规则词型变化] suprasegmental 超音段的syllabification 音节划分syllable 音节syllable structure constraint 音节结构限制symbolization and verbalization 符号化与字句化synchronic 同步的synonym 同义词syntactic category 句法类别syntactic constituent 句法成分syntactic rule 语法规律 [句法规则]syntactic semantics 句法语意学syntagm 句段syntagmatic 组合关系 [结构段的;组合的] syntax 句法systemic grammar 系统语法tag 标记target language 目标语言 [目标语言]task sharing 课题分享 [任务共享] tautology 套套逻辑 [恒真式;重言式;同义反复] taxonomical hierarchy 分类阶层 [分类层次] telescopic compound 套装合并template 模板temporal inference 循序推理 [时序推理] temporal logic 时间逻辑 [时序逻辑] temporal marker 时貌标记tense 时态terminology 术语text 文本text analyzing 文本分析text coherence 文本一致性text generation 文本生成 [篇章生成]text linguistics 文本语言学text planning 文本规划text proofreading 文本校对text retrieval 文本检索text structure 文本结构 [篇章结构]text summarization 文本自动摘要 [篇章摘要] text understanding 文本理解text-to-speech 文本转语音thematic role 题旨角色thematic structure 题旨结构theorem 定理thesaurus 同义词辞典theta role 题旨角色theta-grid 题旨网格token 实类 [标记项]tone 音调tone language 音调语言tone sandhi 连调变换top-down 由上而下 [自顶向下]topic 主题topicalization 主题化 [话题化]trace 痕迹trace theory 痕迹理论training 训练transaction 异动 [处理单位]transcription 转写 [抄写;速记翻译]transducer 转换器transfer 转移transfer approach 转换方法transfer framework 转换框架transformation 变形 [转换]transformational grammar 变形语法 [转换语法] transitional state term set 转移状态项集合transitivity 及物性translation 翻译translation equivalence 翻译等值性translation memory 翻译记忆transparency 透明性tree 树状结构 [树]tree adjoining grammar 树形加接语法 [树连接语法] treebank 树图数据库[语法关系树库]trigram 三连词t-score t-数turing machine 杜林机 [图灵机]turing test 杜林测试 [图灵试验]type 类型type/token node 标记类型/实类节点type-feature structure 类型特征结构typology 类型学ultimate constituent 终端成分unbounded dependency 无界限依存underlying form 基底型式underlying structure 基底结构unification 连并 [合一]unification-based grammar 连并为本的语法 [基于合一的语法] universal grammar 普遍性语法universal instantiation 普遍例式universal quantifier 全称范域词unknown word 未知词 [未定义词]unrestricted grammar 非限制型语法usage flag 使用旗标user interface 使用者界面 [用户界面]valence grammar 结合价语法valence theory 结合价理论valency 结合价variance 变异数 [方差]verb 动词verb phrase 动词组 [动词短语]verb resultative compound 动补复合词verbal association 词语联想verbal phrase 动词组verbal production 言语生成vernacular 本地话v-o construction (verb-object) 动宾结构vocabulary 字汇vocabulary entry 词条vocal track 声道vocative 呼格voice recognition 声音辨识 [语音识别]vowel 元音vowel harmony 元音和谐 [元音和谐]waveform 波形weak verb 弱化动词whorfian hypothesis whorfian 假说word 词word frequency 词频word frequency distribution 词频分布word order 词序word segmentation 分词word segmentation standard for chinese 中文分词规范word segmentation unit 分词单位 [切词单位]word set 词集working memory 工作记忆 [工作存储区]world knowledge 世界知识writing system 书写系统x-bar theory x标杠理论 ["x"阶理论]zipf's law 利夫规律 [齐普夫定律]。
本体概述——精选推荐
本体(ontology)概述本体的定义Ontology的概念最初起源于哲学领域,可以追溯到公元前古希腊哲学家亚里士多德(384-322 b.c.)尝试对世界上的事物分类,在哲学中定义为“对世界上客观存在物的系统地描述,即存在论”[1]。
牛津英语词典定义为“存在的科学或研究”。
当不同的理论家提出本体的不同建议,或者不同的知识领域谈论本体建议时,应该使用本体的复数即本体论(ontologies)以便表示总的本体集合[21]。
信息系统和哲学之间的关系好像永远是两个不同的国度,每个都有自己的语言和文化。
事实上两者各自的研究方向是相互正交的,但今天,哲学的分支――本体论可以充当连接信息系统和哲学之间的桥梁,尽管本体论在信息系统中的作用好像与哲学中的作用完全不同[79]。
信息系统需要推理世界模型,因此研究者采用术语‘本体’在程序中描述表示世界的信息。
信息系统本体论是表述特殊知识领域的形式语言;而哲学本体论解释世界某些领域不依赖于任何特定语言的特殊分类系统,尽管运用语言的概念机制作为描述手段,但却既不可约也不等同于语言或形式体系。
与信息系统本体论相似,哲学本体论确实解释研究领域的知识和概念框架,主要目的是预先忠实的描述,即寻求真理。
无论存在着何种区别,哲学本体论仍能对概念化的框架和信息系统本体论的开发做出一定的贡献,最大贡献是发现研究领域中某些事实,即领域的本性、范围、边界和独特性[79]。
1991年美国Stanford大学的Gruber和Neches等人[37]最早把本体定义为“构成相关领域词汇的基本术语和关系,以及利用这些术语和关系构成的规定这些词汇外延的规则”。
1993年Gruber[1]采用概念化的形式定义<D,R>结构[125],其中D是领域,R是D中相关的关系集合。
把本体定义成“共享概念化的形式的、明确地规范”,因此能够很好地表现出本体的本质特性。
在此定义中,“共享(shared)”反映了本体捕获同感知识的理念,即不是限定到单个的某些人,而是一组人共同接受的知识;“概念化(conceptualization)”指的是世界中某些现象的抽象模型,辨识这些现象的相关概念;“明确(explicit)”意思是清晰地定义所有概念的类型和概念之间的约束;“形式(formal)”意思是机器应该可以理解本体,形式化具有不同的程度。
INFORMATION RETRIEVAL
专利名称:INFORMATION RETRIEVAL发明人:CASE, Simon James,VAN KESSEL, Marcus Sebastian申请号:EP05782824.6申请日:20050915公开号:EP1794687A1公开日:20070613专利内容由知识产权出版社提供摘要:Apparatus (145) for assisting a user to add a new node to an ontology stored in an ontological database (120) especially for use in a just in time information retrieval system. The apparatus comprises analysing means for analysing one or more documents and/or groups of documents associated, by the user, with the new node to be added to the ontology, to generate a characteristic vector for the or each associated document or group of documents, preferably using a latent semantic indexing method. The apparatus further includes a classifier for performing a classification step using the or each characteristic vector to obtain one or more indications of possibly closely related nodes and thereby to identify the parent node or nodes of at least one or more of the possibly closely related nodes. Finally, the apparatus further includes display control means for controlling a display to present the identified parent node or at least one of the identified parent nodes where more than one is identified, for possible selection by the user.申请人:BRITISH TELECOMMUNICATIONS public limited company地址:81 Newgate Street London, Greater London EC1A 7AJ GB国籍:GB代理机构:Williamson, Simeon Paul 更多信息请下载全文后查看。
information retrieval 综述
2What is adatabase?A database is a collection of similar data records stored in a common file (or collection of files).****3Types of databases:examplesExamples: The databases that form the basis for »catalogues of books or other types of documents»computerized bibliographies»address directories»a full text newspaper, newsletter, magazine, journal + collections of these»WWW and Internet search engines»intranet search engines»...****4Information managementInformation retrievalInformation retrievaland related activities: figureImage retrievalText retrieval Presentation ofinformation***-Information retrievaland related activities: explanation •“Text retrieval”can be considered as a part of the largerconcept “information management”.•There is a great overlap:“text retrieval”-“image retrieval”because image retrieval is in most cases based on textretrieval:in most cases retrieval of images is not based oncomputerized investigation of the images themselves, buton searches in the text that accompanies each image.6 ***-Information retrieval:the terminologySeveral words are used with similar or related meanings:»database / databank / corpus / collection / catalog / site /archive / file / web / ...»contents of a database / records / documents / items / (web)pages / ...»search / query / filter / ...»thesaurus / controlled vocabulary / dictionary / lexicon /term bank / ontology / ...»results / selection / retrieved documents / retrieved items /...Information retrieval software:a particular type of DBMS•Software forinformation storage and retrieval(ISR software)•Text(-oriented) database management systems (Text-DBMS)•Text information management systems(TIMS)•Document retrieval systems•Document management systems8Information retrieval:via a database to the user***-Information content Information content Linear file Inverted fileSearch engineSearch interface UserUser Database9Information retrieval: building a database**--Inverted file, index, register of the databaseUserUser Recordsderived from the inputand stored in the database Records fed into the database management systemIndexingRetrieval ?? Question ??The records input in a database system to be indexed do not necessarily appear completely in the output phase;that is: they are not shown completely to the user of the system in the results of a query.Can you illustrate this?The records input in a database system to be indexed do not necessarily appear completely in the output phase;that is: they are not shown completely to the user of the system in the results of a query. Can you illustrate this?**--1011ComparisonInformation retrieval:the basic processes in search systemsInformationproblem RepresentationQuery Indexed documents Representation Retrieved, sorted documentsText documents Evaluationandfeedback ****12Information retrieval systems:many components make up a system •Any retrieval system is built up of many more or less independent components.•To increase the quality of the results,these components can be modifiedmore or less independent of each other.***-Information retrieval systems:important componentsthe information contentsystem to describe formal aspects of information itemssystem to describe the subjects of information itemsconcrete descriptions of information items= application of the used information description systemsinformation storage and retrieval computer program(s)computer system used for retrievaltype of medium or information carrier used for distribution14 ***-Information retrieval systems:the information content•The information content is the information that is createdor gathered by the producer.•The information content is independent of software andof distribution media.•The information content is input into the retrieval systemusing»a system (rules) to describe the formal aspects»a system (rules) to describe the contents(classification, thesaurus,...)Information retrieval systems:media used for distribution•Hard copy(for information retrieval systems only in the broad sense)»Print»Microfiche•For computers:(for information retrieval systems strictu sensu)»Magnetic tape»Floppy disk; optical disk (CD-ROM, Photo-CD, DVD...)»Online16 ***-Information retrieval systems:the computer programThe information retrieval program consists of severalmodules, including:•The module that allows the creation of theinverted file(s) = index file(s) = dictionary file(s).•The search engine provides the search features and powerthat allow the inverted file(s) to be searched.•The interface between the system and the user determineshow they (can) interact to search the database (usingmenus and/or icons and/or templates and/or commands).What determines the results of a search in a retrieval system?1.the information retrieval system( = contents + system)2.the user of the retrieval systemand the search strategy applied to the system Result of a searchResult of a search 18Layered structureof a databaseDatabase(File)RecordsFieldsCharacters + in many systems:relations / links between records***-A simple database architecture: all records together form a database The ‘salami architecture’= ‘sliced bread architecture’»the salami or the bread is a “database”»each slice of salami or bread is a “database record”»there are no relations between slices / records»the retrieval system tries to offer the appropriate slices / records to the user!! Question !!The database architecture described here is simple, but which factors make retrieval nevertheless a complex procedure in many real databases with this architecture?The database architecture described here is simple, but which factors make retrieval nevertheless a complex procedure in many real databases with this architecture?**--2021 Characteristics / definition ofstructured text-information•The text information is structured.(files, records, fields, sub-fields,links/relations among records...)•The length of records and fields can be “long”.•Some fields are multi-valued =they occur more than once =repeated or repeatable fields**--22Structure ofa bibliographic fileRecord No. 1TitleAuthor 1: name + first name Author 2:...SourceDescriptor 1Descriptor 2...Record No. 2Sub-fields Repeated fields**--24 Text retrieval and language:an overviewProblems/difficulties related to language / terminologyoccur•in the case of “multi-linguality”:“cross-language information retrieval”;that is when more than 1 language is used»in the contents of the searched database(s)and/or in the subject descriptors of the searched database(s) OR»in the search terms used in a query•even when only 1 language is applied throughout the system !***-25Text retrieval and language:enhancing retrieval•Retrieval can be enhanced by coping with the problems caused by the use of natural language.•Contributions to this enhancement of retrieval can be made by»the database producer»the computerized retrieval system»the searcher/user •(The distinction between these is not very sharp and clear in all cases.)☺***-!! Task -Assignment !!Read about Language and information retrieval by Large, Andrew, Tedd, Lucy A., and Hartley, R.J.Chapter 4 in: Information seeking in the online age: principles and practice. London : Bowker-Saur, 1999, 308 pp.Read about Language and information retrieval by Large, Andrew, Tedd, Lucy A., and Hartley, R.J.Chapter 4 in: Information seeking in the online age: principles and practice. London : Bowker-Saur, 1999, 308 pp.**--26!! Task -Assignment !!Read about Information organization .By Large, Andrew, Tedd, Lucy A., and Hartley, R.J.Chapter 5 in: Information seeking in the online age: principles and practice. London : Bowker-Saur, 1999, 308 pp.Read about Information organization .By Large, Andrew, Tedd, Lucy A., and Hartley, R.J.Chapter 5 in: Information seeking in the online age: principles and practice. London : Bowker-Saur, 1999, 308 pp .**--2728Text retrieval and language:a word is not a concept (a)Problem:A word or phrase or term is not the same as a concept orsubject or topic.****WordWordConcept!Text retrieval and language:a word is not a concept (a’)So, to ‘cover’a concept in a search,to increase the recall of a search,the user of a retrieval system should consider anexpansion of the query;that is:the user should also include other words in the query to‘cover’the concept.!30 ****Text retrieval and language:a word is not a concept (a’’)»synonyms!(such as :Latin names of species in biology besides the commonnames,scientific names besides common names of substances inchemistry…)!Text retrieval and language:a word is not a concept (a’’’)»narrower terms, more specific terms(such as particular brand names);including terms with prefixes(for instance: viruses, retroviruses, rotaviruses...)»spelling variations(such as UK English versus US English);possible variations after transliteration!32 ****Text retrieval and language:a word is not a concept (a’’’’)»singular or plural forms of a noun(when this is used as a search term)»(relevant) related terms»various forms of a verb(when this is used in the query)»broader terms (perhaps)!Text retrieval and language:a word is not a concept (b)☺•Method to solve the problemat the time of database production:»adding to each database record those codes from aclassification system or terms from a thesaurus system thatare relevant,and providing the user with knowledge about the systemused;in some cases, this process is computerized(with intellectual intervention or completely automatic)34 ***-Text retrieval and language:a word is not a concept (b’)»However, this solution is not perfect:—Addition of terms by humans from a controlledvocabulary / from a thesaurus is not easy and timeconsuming.Consequences:–the added value lags behind the availability of the document–the process can delay access to the document–the process is expensive—Moreover, in practice, most users of the resultingdatabase do not exploit this method offered.Text retrieval and language:a word is not a concept (c)•Method to solve the problem,provided by the computerized retrieval system:»offering to the user a partly computerized access to the particular subject description system used by the database producer, and then linking to the database for searching »computerized, automatic, analysis of the ‘free text’search terms applied in a query by the user, for transparent ‘mapping’to the corresponding particular classification codes, categories, or thesaurus terms used by the database producer ☺36Text retrieval and language:a word is not a concept (c’)»offering the searching user access to a (general) thesaurus system,even when the database producer has not categorised the database contents;in this way, the user can refine his/her query»better, and more generally:computerized, automatic expansion of the query terms introduced by the user, based on a general thesaurus!(however, not many retrieval systems offer this feature)**--☺37Text retrieval and language:a word is not a concept (c’’)»to avoid the problems of possible variationsat the end of search terms:—offering the possibility to the user to truncate a search term explicitly—computerized, automatic, transparent truncationwithout explicit user action**--☺38Text retrieval and language:a word is not a concept (c’’’)»to avoid the problems of possible prefixes and suffixes:—computerized, automatic, transparent, intelligent morphological analysis of the query terms:‘stemming’of the ‘free text’search terms used by the user;however, this does not work perfectly and has not (yet) been implemented in most retrieval systems;for languages that have a richer morphology thanEnglish, this can offer even a larger pay-off**--☺?? Question ??Which problems in text retrieval are illustrated by the following sentences?Which problems in text retrieval are illustrated by the following sentences?****39!40Time flies like an arrow.Fruit flies like a banana.?****Examples41 ****ExamplesTime flies like an arrow. Fruit flies like a banana.42 ****ExamplesTime flies like an arrow. Fruit flies like a banana.OK!43Text retrieval and language:ambiguity of meaning (a)•Problem:A word or phrase can have more than 1 meaning,because natural languages have evolved spontaneously, not strictly controlled.•Ambiguity of the meaning = polysemy.•The meaning can depend on the context.•The meaning may depend on the region where the term is used.•This is a problem for retrieval.•This decreases the precision of many searches.****44Text retrieval and language:ambiguity of meaning (a’)•An example is the word “pascal”, which can have several meanings:»the philosopher Blaise Pascal,»the programming language Pascal,»the physical unit of pressure, and»the name of many persons…•Another example:»Turkey, the country»Turkey, the animal ****Example !45 ****ExampleText retrieval and language:ambiguity of meaning (a’’)•Example of sentences:»The banks of New Zealand flooded our mailboxes withfree account proposals.»The banks of New Zealand flooded with heavy rainsaccount for the economic loss.!46 ****Text retrieval and language:ambiguity of meaning (a’’’)Problem:Ambiguity of meaningmay be the cause of low precision.Relevant conceptWordIrrelevant concept!NOT wantedText retrieval and language:ambiguity of meaning (b)•Method to solve the problemat the time of database production:»adding to each database record codes from a classification system or terms from a thesaurus system,and providing the user with knowledge about the system used;in some cases, this process is computerized(completely automatic or with intellectual intervention); ☺48Text retrieval and language:ambiguity of meaning (b’)•Method to solve the problem,provided by the computerized retrieval system:»offering to the user a partly computerized access to the subject description system and then linking to the database for searching***-☺Text retrieval and language:ambiguity of meaning (b’’)»searching normally (without added value), but adding value by categorizing the retrieved items in thepresentation phase to assist in the ‘disambiguation’; this feature is offered for instance by —the public access module of the book catalogue of the library automation system VUBIS at VUB, Belgium, when a searching items that were assigned a particular keyword☺!! Task -Assignment !!Search Clusty or Vivisimo or Wisenut as an example of a system that applies automatic, computerized subject categorization of database records.Search Clusty or Vivisimo or Wisenut as an example of a system that applies automatic, computerized subject categorization of database records.*---50Text retrieval and language:ambiguity of meaning (b’’’)»Natural language processing of the queries:linguistic analysis to determine possible meanings of the query, which includes disambiguation of words in their context:“lexical”analysis = at the level of the word“semantic”analysis = at the level of the sentenceHowever, most queries are short and therefore it is difficult to apply semantic analysis for disambiguation.☺52Text retrieval and language:ambiguity of meaning (b’’’’)»Natural language processing of the documents:linguistic analysis to determine possible meanings of a sentence, which includes disambiguation of words in their context:“lexical”analysis = at the level of the word“semantic”analysis = at the level of the sentence However, most retrieval systems do not apply this complicated method.***-☺53A word is not a conceptA concept is not a word****Word1 Word2 Word3Concept1Concept2Concept3 The most simple relationbetween words and concepts is NOT valid.54A word is not a conceptA concept is not a word****Word1 Word2 Word3Relevant concept 1 Irrelevant concept 2 Irrelevant concept3•A concept cannot be “covered”by only 1 word or term; this may be the cause of low recall of a search.•The meaning of many words is ambiguous;this may be the cause of low precision of a search.Text retrieval and language:relation with recall and precision Recapitulating the two problems discussed, we can say that•Expansion of the query allowsto increase therecall.•Disambiguation of the query allowsto increase the precision.!56 **--Text retrieval and language:evolution of meaning (a)•Difficulty:The meaning of a word or phrase can change over time.!Text retrieval and language:evolution of meaning (b)☺•Method to solve the problemat the time of database production:»using a categorization systemand also adapting this continuously to the changing realityand meanings of terms58 ***-Text retrieval and language:phrases composed of words (a)•Problem:Most retrieval systems can search for words,but they do not directly recognize or ‘know’phrases / terms composed of more than 1 word.!59Text retrieval and language:phrases composed of words (b)•Methods to solve the problem,provided by the computerized retrieval system:»the user can and should indicate explicitly that a few words should be considered together by the retrieval system as forming a phrase/term(for instance in many Internet search engines by putting the phrase in quotes like “three word phrase”)***-☺60Text retrieval and language:phrases composed of words (b’)»better:the retrieval system automatically recognizes a phrase/term relying on a term bank that has been created in advance;examples:the Internet search engines AltaVista and Scirus work in this way ***-☺Text retrieval and language:searching more than 1 database (a)•Problem:Searching various databases at the same time,or merging databases for searching,suffers from the problem that these databases may usecategorization systems to make the problem ofterminology and language smaller, but in most cases thesesystems are different and incompatible.!62 **--Text retrieval and language:searching more than 1 database (b)•Method to solve the problem,provided by the computerized retrieval system:»mapping of the search term chosen by the user to thevarious thesaurus terms used by the various databases;only a few retrieval systems try to accomplish this☺Text retrieval and language:relations among concepts (a)•Difficulty:In many cases, when the user combines several concepts in 1 search, the searching user cannot well communicate the intended relations among these concepts to the retrieval system.!64Text retrieval and language:relations among concepts (a’)»Example:concept 1 = children/sons/daughters/...concept 2 = parents/fathers/mothers/...concept 3 = beating/violence/...How to find documents on“children beating their parents”while avoiding documents on“parents beating their children ”?**--Examples !65Text retrieval and language:relations among concepts (a’’’)»Example:concept 1 = computersconcept 2 = architectureHow to find documents on“(the application/role/importance of)computers in architecture”,while avoiding documents on“the architecture of computers ”?**--Examples !66Text retrieval and language:relations among concepts (b)•Method to solve the problem,provided by the database producer:»offering facilities to the user for disambiguation,like in the more simple case of singular terms without combinations with other terms**--☺67Text retrieval and language:relations among concepts (b’)•Method to solve the problem,provided by the computerized retrieval system:»natural language analysis ofboththe documentsand the natural language queryto interpret their structure and meaning **--☺68Text retrieval and language:expressing the purpose of a search •Difficulty:Classical queries and retrieval systems work with terms to match the subject, the “aboutness”expressed in the query with the documents,but do not try to express and to understandthe purpose, aim and context of the search.**--!?? Question ??Which are some of the problems caused by the use of language in information retrieval?Which are some of the problems caused by the use of language in information retrieval?***-69!70Text retrieval and multi-linguality(1a)•Problem:When the user does not know well the language of a (monolingual) database, searching is not efficient.**--!Text retrieval and multi-linguality(1b)•Methods to solve the problem,at the time of database production:»adding subject descriptors in various languages(for instance in Pascal and Francis made by INIST )»adding abstracts in various languages(for instance the abstracts in English in INSPEC)»translation of the complete contents of the database These processes can be partly computerized,but they are still time consuming and expensive.☺72Text retrieval and multi-linguality(1c)•Method to solve the problem,provided by the computerized retrieval system:»translating the query of the user,by using a general multilingual thesaurus;however, most free text queries are quite short, which makes it difficult to use the context to limit possible ambiguity;disambiguation by user-computer interaction offered by the query interface, can increase the effectiveness here.**--☺Text retrieval and multi-linguality(2a)•Problem:When documents in a database are written in more than 1 language, searching that database in a single language may not be sufficient to retrieve all interesting, relevant documents.!74Text retrieval and multi-linguality(2b)•Method to solve the problem:»extensions of the methods when only 1 language is used in the documents**--☺Text retrieval and multi-linguality(3)•Problem:When more than 1 database is searched at the same time,the mechanisms to solve problems related to language ineach separate database cannot be applied so wellanymore.!76 **--Text retrieval and multi-linguality(4a)•Problem:Of course, the user should ideally be able to understandthe contents of all the retrieved documents, even whenvarious languages are used in those documents.!Text retrieval and multi-linguality(4b)•Methods to solve the problem,at the time of database production:»adding abstracts in various languages(for instance the abstracts in English in INSPEC)»translation of the complete contents of the database These processes can be partly computerized, but they are still time consuming and expensive.☺78Text retrieval and multi-linguality(4c)•Methods to solve the problem,provided by the computerized retrieval system:»rapid automated translation—of the titles of retrieved records/documents(for instance offered by the Internet search engine AltaVista )—of the abstracts of retrieved records/documents (for instance offered by the Internet search engine AltaVista )—of the complete retrieved records/documents **--☺☺A good text retrieval system solves some problems due to language •accepts words / terms / phrases in the query of the user •maps the words to corresponding concepts •presents these concepts to the userwho can then select the appropriate, relevant concept (“disambiguation”)•searches for this concept,even in documents written in another language •presents the resulting, retrieved documents in the language preferred by the user 80Natural language processing of the documents AND of the queryComparison and matching of bothEnhanced text retrievalusing natural language processingInformationproblemRepresentationQueryIndexed documents Representation Retrieved, sorted documents Text documents Evaluationandfeedback **--81Text retrieval and language:conclusions•The use of terms and language to retrieve information from databases/collections/corpora causes many problems.•These problems are not recognized or underestimated by many users of search/retrieval systems= The power of retrieval systems is overestimated by many users.•Much research and development is still needed to enhance text retrieval.***-!! Task -Assignment !!Recommended reading: Veal, D.C.Progress in documentation: Techniques of document management: a review of text retrieval and related technologies.J. Doc., Vol. 57, No. 2, March 2001, pp. 192-217.Recommended reading: Veal, D.C.Progress in documentation: Techniques of document management: a review of text retrieval and related technologies.J. Doc., Vol. 57, No. 2, March 2001, pp. 192-217.**--82!! Task -Assignment !!Recommended reading: Chowdhury, G. G., and Chowdhury, Sudatta Information retrieval in digital libraries. In: Introduction to digital libraries. London : Facet Publishing, 2003, 354 pp.Recommended reading: Chowdhury, G. G., and Chowdhury, Sudatta Information retrieval in digital libraries. In: Introduction to digital libraries. London : Facet Publishing, 2003, 354 pp.**--83?? Question ??Explain the basic relations/similarities in•speech recognition (speech to text)•translation of a text(text to text)•summarizing texts(text to summary)•text retrieval(query to texts)•cross-language text retrieval (combination)Explain the basic relations/similarities in •speech recognition (speech to text)•translation of a text (text to text)•summarizing texts (text to summary)•text retrieval (query to texts)•cross-language text retrieval (combination )**--8486 ****Hints on how to use informationsources: overview (Part 1)•Know the purpose and motivation for each search.•Do not be lazy: search on your own, before botheringexperts with requests for advice.•Plan your search in advance.•Choose the best source(s) for each search.•Use the available tools for subject searching well.•Try to cope with the language problems;avoid spelling errors in your search query;use spelling variations in your search query。
ontology介绍
Formal Ontology&informal Ontology
Formal ontology:
着重在子类型和上层类型的区分 用逻辑语言或计算机语言描述 公理化ontology由公理和定义陈述
•
用于数学、物理、工程
基于原型ontology由每个子类的典型成员或原型的 比较区分子类型
其他分类系统的索引
Ontology 环境
合并多个Ontology 诊断、升级Ontology
Chimaera Ontology 环境
Knowledge Systems Laboratory , Stanford University Web-based browser environment 接受15种指定格式的输入 提供潜在的合并候选项 支持分类法决议模式,及发现语义包含关 系功能 提供分析功能套件,包括不完全的测试、 分类学分析、语义检查 提供规则语言,以允许用户添加测试工具
文件:ontology
版本信息:
!version: $Revision: 1.48 $ !date: $Date: 2002/01/15 18:25:40 $ !source: $Source: /share/go/cvs/go/doc/GO.doc.html,v $ ! !Gene Ontology ![domain of file] !editors: Michael Ashburner (FlyBase), Midori Harris (GO), Judith Blake (MGD) !Leonore Reiser (TAIR), Karen Christie (SGD) and colleagues !with software by Suzanna Lewis (FlyBase Berkeley).
语义知识的组织模型
语义知识的组织模型介绍在计算机领域,语义知识的组织模型是一个重要的研究方向。
语义知识是指我们对世界的理解和认知,也是计算机在理解和处理文本、图像和其他数据时所依赖的基础。
语义知识的组织模型旨在建立起一种结构化的方法,使得计算机能够有效地存储、检索和利用语义知识。
语义知识的组织方式1. 层次结构 (Taxonomy)层次结构是一种将语义知识组织成树状结构的方法。
在层次结构中,概念被划分成多个层级,每个层级都有其父概念和子概念。
通过层次结构,我们可以方便地浏览和理解概念之间的关系。
示例: - 动物 - 哺乳动物 - 狗 - 猫 - 鸟类 - 鸽子 - 鸦2. 语义网络 (Semantic Network)语义网络是一种将语义知识组织成图状结构的方法。
在语义网络中,概念被表示为节点,而概念之间的关系则被表示为边。
通过语义网络,我们可以更好地理解和推理概念之间的关系。
示例: - A 是 B 的一种 (A is a kind of B) - A 是 B 的一部分 (A is a part of B) - A 是 B 的属性 (A is a property of B) - A 是 B 的目的地 (A is a destination of B)3. 本体论 (Ontology)本体论是一种将语义知识组织成定义、分类和关联概念的方法。
本体论通过定义概念、属性和关系,建立起一种形式化的知识结构,以便计算机能够理解和使用这些知识。
示例: - 概念定义:定义概念的意义、特征和限制。
- 概念分类:将概念划分为不同的类别。
- 属性定义:定义概念的属性和特征。
- 关系定义:定义概念之间的关系和联系。
语义知识的应用语义知识的组织模型在许多领域有着广泛的应用。
以下是一些常见领域的应用示例:1. 自然语言处理 (Natural Language Processing, NLP)在自然语言处理领域,语义知识的组织模型被用于语义解析、语义推理和语义理解等任务。
一种基于Ontology的文献领域语义检索机制的实现
K e y w o r d s : s e m a n t i c We b ; o n t o l o g y ; s e m a n t i c r e t r i e v a l ; s e m a n t i c r e a s o n i n g ; O WL ; s W】 R L
袁 辉 , 李 延 香
( 1 . 陕西工业 职业技术学 院 信息工程 学 院, 陕西 咸 阳 7 1 2 0 0 0  ̄
2 . 咸 阳师范学 院 信息工程学 院 , 陕西 咸 阳 7 1 2 0 0 0 )
女
— 蚕
手
面
藤 系 统 存 在 的问 题, 结合 文 献 信 息 的 特 点, 本 文 设 计了 一 个 基 于 0 n t o l o g y 的 文 献 领 域 语 义
Ontology及其在知识组织中的应用
Ontology及其在知识组织中的应用作者:孙鑫来源:《经济研究导刊》2011年第27期摘要:随着信息技术的不断发展,知识组织的技术手段也逐渐趋于丰富,Ontology就是一种新型的知识组织方法工具。
对Ontology的概念、构成和分类进行了简单介绍,并对它与传统知识组织工具进行了辨析,最后还具体分析了Ontology在知识组织中的应用。
关键词:Ontology 知识组织应用中图分类号:TP3文献标志码:A文章编号:1673-291X(2011)27-0215-03一、Ontology概述1.Ontology的起源和概念。
Ontology(本体)原本是一个哲学上的概念,是表述哲学理论的一个术语,由希腊ontos(存在)与logos(学说、言论)派生而来。
在西方哲学史中,它被解释为“关于存在及其本质和规律的学说、言论”。
可以这样理解,作为一个哲学概念,Ontology是对客观世界的一个特定的分类体系,这个体系不依赖于任何特定的描述语言。
近一二十年来,Ontology也逐渐受到计算机科学科学界的关注,从研究的情况可以看出,把现实世界的某个应用领域抽象或概括成一组概念及概念间的关系,构造这个领域的本体,会使计算机对该领域的知识的处理与组织大为方便。
因此本体论现在已被广泛应用在知识工程、知识表示、信息检索、信息摘要、知识管理等领域。
在知识工程领域,Neches等人最早给出了对于Ontology的定义,他认为,“一个Ontology 定义了组成相关领域词汇的基本术语和关系,以及用于组合术语和关系以定义词汇的外延的规则。
”(Neches R,1991)此后,很多学者都对ontology给出了自己的定义,但是始终没有形成一个统一的认识。
美国斯坦福大学的知识系统实验室的学者Gruber指出,“An ontology is an explicit specification of a conceptualization”(Gruber,1993),即“本体是概念化的一个明确的规范说明”,几年后,学者Borst对这个定义作了进一步的修改,认为,“Anontology is a explicitly formal specification of shared conceptualizat ion”(Borst,1997),即“本体是对共享概念模型的明确的形式化的规范说明”。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Ontology-driven Information Retrieval in FF-PoirotRoberto Basili,Marco Cammisa(†), Marco PennacchiottiDept.of Computer Science(DISP)University of Roma,Tor VergataEmail:surname@info.uniroma2.itFabio Massimo ZanzottoDISCOUniversity of Milano”Bicocca”Email:zanzotto@disco.unimib.itDario Saracino,Maria Vittoria Marabello(†)Knowledge Stones S.r.l.RomaEmail:{mmarabello,dsaracino}@ais.itAbstract—This paper proposes a new approach for supporting domain information retrieval and information extraction on the web,using an original query expansion technique supported by an ad-hoc ontology focused on a specific domain of interest. The system has been built and tested in the framework of the FF-Poirot project,for supportingfine-grain retrieval from the Internet aiming at detectingfinancial fraudent sites.In afirst stage,using a short list of keywords given by the user,the application mines the web retrieving relevant documents.These documents are then clustered into coherent groups focusing on specific subjects.The ontology model is devoted to represent the most important concepts of the domain of interest and to link them to the user need as expressed by the keywords.Once clusters of documents are made available after thefirst stage, the ontology can be used to extract from these clusters the most interesting documents(the most probable fraudolent sites in the framework of the FF-Poirot application).Browsing the ontology and selecting specific concepts,the user starts a query expansion engine that refines the search,creating a new query based on terminological evidences tied in the ontology to the selected concepts.The paper describes the overall software architecture of the application as used in the project,focusing specifically on the query exapansion engine and the supporting ontological model adopted.Experimental evidences,as emerged in FF-Poirot,will be used to prove the feasibility and the advantages of the adopted technique.I.I NTRODUCTIONSemantic methods for Information Retrieval(IR)are inher-ently limited by the influence of dangerous phenomena of ambiguity and lack of coverage.Semantic Web applications are even more problematic as for the size and heterogeneity of the target data/information.In[1],a linguistically motivated ontology model that integrates domain information with a specific lexical semantic subsystem has been presented.One potential application of the model is IR as it allows to bridge flexibly the gap between ontological primitives(concepts and n-ary relations)and lexical knowledge,e.g.terminology and verb argument structures.This model has been fully exploited in the FF-Poirot project to supportfine-grain retrieval from Web pages aiming to detectfinancial fraud.The ontology is here used to drive(and increase precision)the behaviour of meta-search engine mining the Web contents.Specifically,positioned in the context of the Semantic Web, the FF-Poirot project(EU-IST2001-38248,[2])aims to build computable knowledge resources(e.g.,financial and forensic anthologies)and specific methodologies and systems to supportfinancial and legal expertise in detecting and pre-ventingfinancial frauds on the Web(V AT processes,securities exchange,investments,banking and insurance services).In the framework of the project,a key role is played by technologies able to support the expert in searching and mining the web looking for potential fraudulent sites.A completely automatic and unsupervised agent for such kind of web searches is in fact too far reaching today,for two main reasons.From the one hand,knowledge and reasoning required to detect Web fraud rely on heterogeneous and complex decision making that is far beyond language processing capabilities:for example, often clues for frauds are even outside the language sphere (e.g.a not trusted web server).On the other end,such knowledge is highly dynamic,as fraudulent actors adapt their strategies to the countermoves of thefinancial institutions.The focus is rather on effective Web search processes that trigger the detective activities:a supporting NLP system can offer advanced linguistic and ontological capabilities to speed-up and refine the IR activity carried out by the legal andfinancial experts.In this framework,the IR query expansion system proposed in FF-Poirot uses both specific ontological resources and NLP techniques in order to boost the IR activity with the linguistic knowledge needed to improve both precision and efficiency of the search.The aim is to set the search activity in an intuitive and coherent software environment(Protege[3]),in which the expert is allowed to navigate afinancial ontology,and retrieve automatically documents related to one(or more)concepts that are tied to fraudulent events.In Sec.II a short overview of the FF-Poirot application is presented,while in Sec.III an overview of the model for query expansion is given.Moreover,in Sec.IV and Sec.V the supporting ontological model and the IR query expansion engine are described in deep.Afirst evaluation of the engine is then presented in Sec.VI.Finally,Sec.VII outlines some conclusions and thoughts on the use of ontology-driven IR in the framework of the Semantic Web technologies and perspectives.II.T HE FF-P OIROT A PPLICATIONThe FF-Poirot application has been developed together with the CONSOB1expertise and is inserted in the range of tools 1Consob,the Italian public authority responsible for regulating the Italian securities market,plays the role of user in FF-Poirotdevoted to monitor online fraudulent activities by means of Information Retrieval systems,whose performance are em-powered by the embedded use of domain-specific knowledge resources.The application is intended to support government agencies(but also servicesfirms and corporations)in the detection and prevention of online frauds(abusive invest-ment solicitation,unauthorized investment services execution) against investors.The application is basically ontology-driven:the use of domain-oriented knowledge resources gives the opportunity to show how these instruments allow the implementation of powerful and yetflexible solutions,in principle portable across key applications domains in industry and trade.A search agent has been then developed,providing instru-ments to perform the following tasks:•Monitoring Web sites to look for illegal investment so-licitation or unauthorized investment service offers.The selection and extraction of potentially fraudulent sites is an ontology-driven,Information Retrieval process,where the domain-specific ontology has been created by using information coming from the user context description(i.e.specific material onfinancial fraus domain providedby the CONSOB experts);access to ontology is used to improve the relevancy of the retrieval.•Ontology-driven search.In the CONSOB Ontology do-main concept,linguistic knowledge(word senses)and ter-minology are represented.The latter information(senses and terminological entries)can be used to query the mined Web material.They extend the set of keywords used to search and cluster the large amount of Web pages related to the fraudfinancial investments and can be interactively used by the expert both before and after the download.In the before modality the ontology offers the user concept-based views of the documentary knowledge, and enable the naming of the different text clusters derived according to the defined concept hierarchy and word sense network.•Ontology-driven browsing carried out by a query expan-sion engine.After the download of documents from the Web the amount of texts to be inspected and verified is still challenging for the expert.In this phase the onto-logical concepts can be used to browse the mined Web material during inspection.Interactively,the expert can look at individual clusters of documents as they are made available by the IR component.Alternatively the user can just navigate individual clusters through the concept hierarchy and look for specific concepts/abstractions(e.g.”investment bank”or”capital gain”,”net worth”).We can call them the target search concepts(tsc).By select-ing these tsc the user expresses his interest in focusing only on the documents dealing with those notions.The application supports the expansion of the tsc by means of all their related terms.In synthesis all the tsc and their generalizations are carefully inspected by the system and terminological entries connected with any of them are collected.This set of terms is then used by the Websearch engine in a query expansion phase:documents internal to a cluster are re-ranked and form a user specific cluster.This re-ranking is helpful after the downloading to intelligently allow the system to refocus on several tsc within a single Web mining session.The ontology offers thus a sort of semantic GUI for Web sites inspections. Focus of this paper is to describe the query expansion engine and the ontological model that drives the search.Indeed,such components can be seen as the basic components of a new scalable and portable solution for domain specific ontology-driven Information Retrieval,strongly based on the principles and standards of the Semantic Web.The next sections thus describes the FF-Poirot query expansion sub-system,focusing both on the ontology and the expansion engine.III.O NTOLOGY-DRIVEN IR MODELAim of the query expansion engine,as underlined in the previous section,is to automatically refine a user query and to re-rank a cluster of documents already downloaded by the system and built by other modules of the FF-Poirot application. Each cluster represents a Task Category tc,and contains all documents D tc related to a specific area of interest for the user (e.g.on-line investments).The engine thus operates on the set of task categories T C and the related set of documents D T C. Task Categories are carefully integrated in an ontological model in which domain knowledge can be browsed by the user to select a specific concept,in order to re-rank documents in D T C and show only those interesting document that satisfy its specific information need.For example the expert could ask for pages related to the concept new cooperative credit bank,as it is a potentially fraudulent activity if present on the web.This sort of con-ceptual query is thus submitted to the system.The system should then be able to start up an IR search on D T C,using the linguistic knowledge(i.e.,IR keywords in form of terms related to”new cooperative credit bank”)needed to retrieve the most promisingly fraudulent web site.At the end of the search,the retrieved sites should then be presented to the expert for afinal inspection.In this framework there are two main advantages:•The expert is not requested to build any specific query for the IR engine.Indeed,the expert must only browse the ontology looking for the desired concepts.The burden of finding the often complex linguistic and query level ex-pressions of the information is transferred to the system.•The linguistic knowledge encoded in the system is able to refine the query using all the linguistic material related to the specific concept,that in most cases can not be foreseen by the expert.Moreover,linguistic material can compose thefinal IR query in a complex weighted Boolean expressions to boost the search.IV.T HE ONTOLOGICAL MODELThe system architecture needs three main knowledge layers: an ontology(Sec.IV-B),representing generic domain concep-tual knowledge,a corresponding domain linguistic knowledge,and a specific set of Task Categories(Sec.IV-A)expressing the user profile that drives the search.The query expansion system(Sec.V)that builds the query for the external IR re-ranking engine works on the basis of these knowledge layers,by activating an ontology concept (conceptual query).The different layers are examined to ex-tract the linguistic material to form the query:moreover,each keywords is properly weighted according to the importance it assumes in the conceptual query.The ontology gives the main contribution in the query building process,as it contains most of the linguistic mate-rial.While its semi-automatic building process can be time consuming,it is a one time effort,as domain knowledge is quite static andfixed in time.On the other hand,task categories express a sort of user specific information need. They must then be set-up independently for each user:luckily, their building process is faster and less complex,as only little and unstructured knowledge is needed.A.The task categoriesTask categories are used to represent a user profile for the search.Each category represents a specific user information need,and simply consists in a list of keywords.Categories are built by previous stages of the application(see Sec.II):during system set-up,the user is asked to enter the list of keywords from which he would start its search.In a conceptualization phase keywords are then clustered semi-automatically into categories T C using lexical-semantic criteria.A knowledge engineer takes care of supervising the process.The web search engine thus browse the web retriving documents D tc for each category tc.Ranking in each cluster is a side-effect of the adopted search engine.In thefinal CONSOB application,this initial Web mining phase triggered by about110user keywords (proposed by the experts).Semantically,task categories thus represent a sort of situa-tional areas of interest,as implicitly expressed by the user,that is,typical domain situations in which the user is interested. In order to integrate this implicit information need into the domain ontology that will support the query expansion phase, a specific anchoring phase is devoted to link categories to the ontology.In particular,each category is represented in the ontology by a so called task relation as it will be described in the next section.For the CONSOB application10categories have been de-signed according to the user seeding information(keywords). Some examples are reported in Table I.B.A syntactic-semantic interface ontologyThe ontology for the IR expansion system is based on the ontological model proposed in[1].Aim of the ontology is to model a syntactic-semantic interface between a specific domain knowledge and its linguistic realizations,in the frame-work of the Semantic Web.The model is in fact formalized in the OWL ontology language.A bridge between the domain conceptual knowledge(called Domain Ontology,DCH)andTask Categories KeywordsFRANCHISING INVESTMENT franchising (INVESTIMENTO IN FRANCHISING)partneriniziativapartner della iniziativatitolititoli azionariazionicollocamentoinvestimentoCOMPANY INVESTMENT emissione obbligazionaria (INVESTIMENTO SOCIETARIO)emissione azioniemissione quotevalore nominaleaumento di capitale socialeazioni privilegiateazioni ordinarieprestito obbligazionariotitoli azionariGAIN INVESTMENT capitale sociale(INVESTIMENTO IN QUOTE)quota socialequota associativaquota societariadividendiazioni privilegiateazioni ordinarieprestito obbligazionariotitoli azionariON-LINE INVESTMENT investimento on-line(INVESTIMENTO ON-LINE)investimento on lineinvestimento in InternetTABLE IE XAMPLES OF T ASK C ATEGORIES FOR THE CONSOBAPPLICATIONtheir linguistic counterparts(called Lexical Knowledge Base, LKB)is also modelled in the ontology.The DCH is formed by a set of domain Concepts and relations among them,called Semantic Relations.Semantic Relations define the useful(typed)relationships required by a given application.Relations usually define what is often expressed linguistically in terms of complex verb predicates.A Semantic Relation in our ontological model has a frame-like semantics.The resemblance with the notion of Frame,as used within the FrameNet project[4]is strong:indeed,Semantic Roles here corresponds to Frame Elements.In anfinancial application,for example,a typical Semantic Relation is Selling and it involves concepts like legal entities(companies and persons),products,money and so on.Major properties of the domain Relations are Semantic Roles,usually employed to characterize the concepts participating(i.e.that act as slotfillers).Semantic Roles are thus role labels for the Concepts involved in a relation. As a semantic relation r fully determines the specific concepts allowed as itsfillers,legal(i.e.allowed)values for the Role slot are ontological concepts,i.e.semantic restrictions to individuals suitable as Rolefillers.In this way, Selectional Restrictions are implemented as type restrictions on Rolefillers.For example,a typical Semantic Role for a Selling relation is Buyer:its slotfiller could be the legal entities concept in the DCH.Also Good and Moneyare roles with type restrictions as products/shares and money respectively.More formally,using a Description Logic formalism the Selling relation can be defined as follows: Selling≡(∃hasBuyer.Legal Entity)(∃hasDonor.Legal Entity)(∃hasGood.(Share Product))(∃hasMoney.Money)The DCH is then devoted(as in the traditional view on the ontology)to define properties of individuals,relations and typical task involved in the application process.The LKB constitutes the language component including lexical semantic information:Terms,Word Senses(inspired by WordNet and making use of a consistent subset of the hyponymy hierarchy,the Wordnet Base Concepts(WNBC)[5] and linguistic relations(mainly Verbal Predicates)structured according to linguistic methods and principles and modelled independently from the domain knowledge.DCH and LKB are mapped through specific ontological sub-hierarchy or assertions.Specifically,Concepts are mapped to Terms through a property called related terms.Terms are usually unambiguous in a specific domain,so that they are mapped to a single concept;on the other hand a concept will be linguistically represented by one or more(related)terms. For example the concept new cooperative credit bank is mapped to its(Italians)terms through the restricted property:∀related terms(nuova banca di credito cooperativonuovo banco agricolo mantovano)Semantic Relations are mapped to Verbal Predicates:as both hierarchy are formalized using the frame formalism,their mapping is more complex and requires a specific sub-hierarchy to link Semantic and Syntactic Roles.The ontology is semi-automatically built in an incremental fashion.Starting from a minimal ontology,made available in the early phases of the ontology engineering process,an incremental process takes care of interleaving a NL learning phase(to acquire linguistic knowledge)with the ontology engineering task.The NL learning phase is carried out by linguistic systems able to automatically extract terms and verbal predicates from large corpora[6][7].More details can be found in[1].Specifically,two main sub-hierarchy of the abovementioned model are used in the FF-Poirot application:Concepts and Terms.The Concepts hierarchy is used by the expert to build the conceptual query:once the needed concept is selected,all Terms linked to it are retrieved,by the related terms concept’s property.Terms will then be used to automatically build the actual IR query.The original ontological model presented in[1]has been augmented with specific type of semantic relations,called task relations.In the FF-Poirot framework,a task relation represents afinancial event/situation strictly related to the task of fraud detections that are interesting for the expert.Each relation is described by appropriate semantic roles.Semantic role values are restricted to specific ontological Concepts through selectional restrictions(expressed in disjunction in a for-all OWL restriction).Concepts used as restrictions are called task concepts.All the presented layers of knowledge arefinally tied er knowledge(task categories)is linked to do-main conceptual knowledge(Concepts)through selectional restrictions.Domain conceptual knowledge is linked in turn to domain linguistic knowledge(Terms)through explicit on-tological relations.The resulting semantic description enables a suitable query expansion mechanism triggered by the acti-vation of task categories and domain concepts.For example the task category(relation)Investment Solic-itation describes the generic situation offinancial investment solicitation carried out on the web,using specific semantic roles properly restricted to specific task concepts.The interplay of these different layers constitute the strength and the richness of the above query expansion system.V.T HE QUERY EXPANSION ENGINEAim of the system is to create specific domain IR queries, starting from a conceptual query activated by the expert selecting an ontology Concept.The output should then be a list K of complex keywords(e.i.,Terms)to submit to the external IR engine.Moreover,keywords should be properly weighted, to allow the IR engine to give more importance to keywords more tied to the conceptual query.Weights can range between 0and1.As a driving example consider an expert looking for a fraudulent activity consisting in an investment solicitation carried out by a fake cooperative bank on the web.The expert could then activate the Concept new cooperative credit bank. The algorithm,summarized in Fig.1,proceeds as follows. Given the selected Concept c,all its linked Terms T c (related terms property)are inserted in K c,with weight1, as they represent the keywords that better express to the conceptual query.In the example:T c={nuova banca di credito cooperativo,nuovo banco agricolo mantovano}A climb in the hierarchy then begins to examine the c ancestors.The aim is to add to K c also terms related to the ancestors,since they can be useful to further refine the search.As more general ancestors are less significant for the original conceptual query,terms extracted from higher levels of the hierarchy receive increasingly lower weight(the weight function is described in Sec.V-A).As the Concepts hierarchy allows for multiple-inheritance, more than one climb path can be followed.Each path stops when a task concept restricting a task relation tr is met.The set of task relations activated by the task concepts is called T R,while the corresponding categories is denoted by T C. The algorithm thus forms the IR query Q as a Boolean combination of the weighted terms.Terms directly linkedFig.1.Query expansion algorithm.First call is expand query(c,1).Reweight function is described in Sec.V-Ato the conceptual query are the most important and receive weight1.On the other hand,terms related to the activated task categories are also relevant,as they express the specific situation that the user had implicitly in mind activating the concept.In between there are all the terms linked to the climb-up paths.A virtuous mix of user and domain linguistic knowledge is then reached.In the example,the concept new cooperative credit bank has multiple-inheritance with two parents:cooperative credit bank(path1)and new credit bank(path2).Terms linked to cooperative credit bank and new credit bank are added to K c with a proper weight.Ancestors of cooperative credit bank in the climb path are:bank←financial institution←financial subject.financial subject is a task concept,as it restricts the roles addresse,partners and speaker of the task relation investment solicitation(ts1).path1thus stops.On the other hand,path2continues to climb its ancestors,that are actually the same of path1.No new terms are then added.All the climb paths are stopped:the overall climb thusfinishes.As a result,K c is formed by42weighted terms:K c={nuova banca di credito cooperativo1,nuovo banco agricolo mantovano1,banca di credito0.34,nuova banca0.34,titolare di diritto0.01,titolare di conto0.01,titolare di azione0.01,soggetto pubblico0.01,soggetto privato0.01,soggetto0.01,soggetto finanziario0.01,istituto finanziario0.05, istituto di credito0.05,istituto bancario0.05,istituto0.05,istituzione finanziaria0.05,istituzione bancaria0.05,istituzione0.05,investitore istituzionale0.05,banco di Sicilia0.05,banco ambrosiano0.05,banca virtuale0.05,banca toscana0.05,banca telefonica0.05,banca popolare0.05,banca nazionale0.05,banca locale0.05,banca italiana0.05,banca generale0.05, banca estera0.05,banco di Roma0.05,banco di rete0.05, banco di marca0.05,banco di gruppo0.05,banca depositaria 0.05,banco con prestiti onLine0.05,banca comunitaria0.05,banca commerciale italiana0.05,banca commerciale0.05, banca Antonveneta0.05,banca agricola mantovana0.05,banco 0.05,}Finally,K ts1are added to K c with weight1:K ts1={il guadagno,i guadagni,il gratis,il gratuito,il valore nominale,il documento informativo,la registrazione delle azioni,il sottoscrivere azioni}The resulting expanded query is processed against the set of documents D T C retrieved from the Web according to the generic keywords related to the involved Task Categories (tc∈T C).For each tc,the set of most promising documents D tc is initially retrieved from the Web by thus maximizing recall.Then,the query expansion algorithm operates on D T C by selecting and re-ranking items according to the K c set:a more precise set of documents D Q is thusfinally obtained and presented to the user.A.Terms weighting functionDuring the climb of a query path,each term t i related to a specific concept c i is weighted according to the hierarchical distance of c i from the concept c activated by the user. Specifically,the more generic the ancestor c i is,the less weight its terms receive.The underlying assumption is that more general concepts are less interesting for the user,as they are less tied to the information need.The degree of generality of c i is evaluated not only on the basis of its position in the hierarchy,but also on the number of its descendants(concept with many descendants is supposed to be more generic than a concepts with few).The weighting function is thus:f(c i)=f(c i−1)|T i|f(c0)=1where T i is the set of terms t i linked to c i,f(c i)the weighting function for c i predecessor in the climb path and c0is c.VI.U SE C ASE E VALUATIONThe ontology has been developed and implemented using OWL to allow a high level of interoperability in the context of the Semantic Web.Moreover,in order to effectively support the user during the conceptual query formation phase and the final retrieval,both the ontology and the query engine have been integrated in a common graphical interface,envisioned as a plug-in of the Protege ontology management tool.The plug-in shows the ontology to the user,which can easily browse the concept hierarchy and activate the desired conceptual query (Fig.3).The application then wakes up the query expansion engine,that processes the undelying query and starts the re-ranking process on D T C.Finally,results of re-ranking are proposed to the user.A qualitative evaluation of the query expansion engine can be drawn looking at the beneficial effect on some specific user queries.In particular,in the framework of the whole FF-Poirot application,the engine is expected to improve the accuracy inretrieving possible fraudolent sites with respect to the system without re-ranking.As a use case we consider a tipical session in which the CONSOB expert is looking for fraudolent site on gain investment in which an internet user is invited to sign an agreement for acquiring quotes on a fake company.Without the use of the expansion engine,the application simply returns the cluster related to the gain investment Task Category tc: documents of the cluster D tc are showed by the graphical interface,ranked according to a score derived from the user keywords of tc.As it can be seen in Fig.2the linguistic knowledge embedded in the application is not specific and effective enough to retrieve interesting site.As a matter of fact,thefirst two ranked web pages(www.canottierimestre.it and info.popcremona.it)are not fraudolent sites.Once the expansion engine is integrated in the application the quality of the system improves:indeed,the expert is allowed to expresses its information need in a more punctual and coherent way,by selecting the specific concept he is interested in.The expert thus selects the concept”agreement module”,to focus on all documents related to quotes agreement processes(Fig.3).The selection event activates the query expansion engine,which augment the linguistic knowledge for the query with the following terms:modulo di adesioneprospettoprospetto informativoprospetto contabileprospetto di quotazioneprospetto di dettaglioprospetto riepilogativoMoreover,the climbing algorithm terminates in the gain investment task categories,activating the related cluster docu-ment d tc.The external IR engine thus uses the refined query to search and re-rank documents in d tc.As a result,the graphical interface will show the new results(Fig.4).As expected,the best ranked documents are web sites which embed a possibly fraudolent activity.Specifically, is a web page offering dangerous offers to quotes acquisition of untrusted companies.Several typical uses of the application,as the one described above,indicates in most cases an improvement in the retrival of fraudolent sites,revealing the beneficial effect of ontology-based methods to refine specific domain queries.Notwith-standing,an extensive evaluation,and more in general the set-up of an evaluation model,is still required in order to better verify the impact of the ontology in the general application framework.VII.C ONCLUSION AND PERSPECTIVESThis paper presented a new methodology for supporting Information Retrieval on specific domain,using a query expansion system based on an original ontological model. First experimental evidences emerged during the FF-Poirot project gave encuraging results:as a matter of facts,the accuracy of the system and the simple interface boosted and improved the process of retrieving fraudolent sites.Even if the system has to be intended as a prototipical architecture,further improvements could lead to a real and powerful Semantic Web application for mining the web and coherently organize and access the large amount of information retrieved. Moreover,the effective use of the ontology for supporting query expansion is an interesting example of how ontology-based techniques can be succesfully exploited in the frame-work of IR and IE applications.Specifically,it emerges that in order to make the use on ontology effective in real applications,the represented conceptual knowledge must be strictly tied to the lexical knowledge as it emerges from domain textual material.We believe that only the integration and an explicit model of these two layers can be succesfull in bridging the gap between ontological knowledge and the world of real applications and resources.Notwithstansing,the development of automatic techniques to link conceptual and lexical knowledge are still a major challenge.As a future work we will thus focus on improving the ontology learning phase, in order to make the whole process of building the knowledge base as most automatic as possible.The use of relational knowledge,both at the conceptual and lexical level,has also to better explored.Verb and nominal relations between terms can be in fact exploited to further enrich the expansion system, as they represent the most part of the domain knowledge as enclosed in documents.A CKNOWLEDGMENTThe authors would like to thank the Consob experts working to the FF-Poirot project.Their feedback and support during the whole architecture building process contributed significantly to the success of the system.R EFERENCES[1]R.Basili,M.Pennacchiotti,and F.M.Zanzotto,“Language learningand ontology engineering:an integrated model for the semantic web,”in Proceedings of2nd International MEANING Workshop,Trento,Italy, 2005.[2]G.Zhao and R.Verlinden,“Ff poirot ontology development portal,”inFF Poirot Deliverable D6.1.Brussel:STAR Lab.,2003.[3]W.E.Grosso,H.Eriksson,R.W.Fergerson,J.H.Gennari,S.W.Tu,and M.A.Musen,“Knowledge modeling at the millennium(the design and evolution of protege-2000,”1999.[4] C.F.Baker,C.J.Fillmore,and J.B.Lowe,“The berkeley framenetproject,”in In Proceedings of the COLING-ACL,Montreal,Canada,1998.[5]P.V ossen,EuroWordNet:A Multilingual Database with Lexical SemanticNetworks.Dordrecht:Kluwer Academic Publishers,1998.[6]R.Basili,M.Pazienza,and F.Zanzotto,“Acquisition of domain con-ceptual dictionaries via decision tree learning,”in Proceedings of the ”15th European Conference on Artificial Intelligence(ECAI2002),Lyon, France,2002.[7]R.Basili, A.Moschitti,and M.Pazienza,“Language sensitive textclassification,”in In proceeding of6th RIAO Conference(RIAO2000), Content-Based Multimedia Information Access,Coll ge de France,Paris, France,2000.。