Semantic-based Composite Document Ranking
英语科技术语的构成
Abbreviation
shortening a phrase or word to create a new term (e.g., "AI" for "artistic intelligence")
Acronym
creating a new word from the
initial letters of a phrase (e.g.,
Common affixes in science and technology terminology include prefixes such as "bio -", "chem ", "electro -", and suffixes such as "- ology", "- graph", which help to form specialized terms
Affixes: attached to the beginning or end of roots, attachments modify the meaning or function of words They can indicate number, tense, aspect, or other grammaticarming English Science and
Technology Terminology
Derivative method
Attachment
adding prefixes or suffixes to base words to create new terms (e.g., "bio -" for biology related terms, "- ology" for fields of study)
关系抽取研究综述
关系抽取研究综述母克东;万琪【摘要】信息抽取、自然语言理解、信息检索等应用需要更好地理解两个实体之间的语义关系,对关系抽取进行概况总结。
将关系抽取划分为两个阶段研究:特定领域的传统关系抽取和开放领域的关系抽取。
并对关系抽取从抽取算法、评估指标和未来发展趋势三个部分对关系抽取系统进行系统的分析总结。
%Many applications in natural language understanding, information extraction, information retrieval require an understanding of the seman-tic relations between entities. Carries on the summary to the relation extraction. There are two paradigms extracting the relation-ship be-tween two entities: the Traditional Relation Extraction and the Open Relation Extraction. Makes detailed introduction and analysis of the algorithm of relation extraction, evaluation indicators and the future of the relation extraction system.【期刊名称】《现代计算机(专业版)》【年(卷),期】2015(000)002【总页数】4页(P18-21)【关键词】关系抽取;机器学习;信息抽取;开放关系抽取【作者】母克东;万琪【作者单位】四川大学计算机学院,成都 610065;四川大学计算机学院,成都610065【正文语种】中文随着大数据的不断发展,海量信息以半结构或者纯原始文本的形式展现给信息使用者,如何采用自然语言处理和数据挖掘相关技术从中帮助用户获取有价值的信息,是当代计算机研究技术迫切的需求。
软件体系结构概述
软件体系结构是具有一定形式的结构化元素,即构件的集合,包括处理构件、数据构件和连接构件。处理构件负责对数据进行加工,
数据构件是被加工的信息,连接构件把体系结构的不同部分组组合连接起来。这一定义注重区分处理构件、数据构件和连接构件,这
一方法在其他的定义和方法中基本上得到保持。
第6页,共41页。
软件体系结构概述
动态/静态处理联系
连接的实现形式影响组件的设计与实现
e.g. 同步调用/异步调用
第19页,共41页。
软件体系结构概述
组件的动态特性
运行调度
运行环境资源的分配和多任务的并行执行
生存期管理
组件运行实例的产生和撤销,包括由组件负责的其他类型组件的产生和撤销。
第20页,共41页。
软件体系结构概述
between processing elements, data elements, and connecting elements, and this taxonomy by and large persists
through most other definitions and approaches.
第25页,共41页。
软件体系结构概述
软件体系结构是软件开发过程中的管理
明确了对系统实现的约束条件,能够支持系统的质量属性实现。
可行性分析时避免方向性错误
制定工程进度和投资计划的依据,决定了开发组织的组织结构,保障项目顺利进行的关键
软件开过程的关键里程碑
第26页,共41页。
软件体系结构概述
软件体系结构支持复用
产品线
构件(库)
软件框架
软件体系结构是需求和代码之间的桥梁,为开发提供了建设的蓝图,也是测试、维护和升级的依据。
基于排序支持向量机的组合相似度图像检索
Based Image Retrieval, 基于 内 容 的 图 像 检 索 ( ContentCBIR) 技术的主要途径是通过视觉内容组织数字图像文件 , 其目标就是从图像库中检索与用户提交的查询例图在内容上 [1 - 3 ] 。 对于自然图像, 一致或相似的图像集合 内容相似一般 指视觉相似; 而对于专业性强图像数据 , 内容相似则包含相关 领域知识。如在医学图像检索领域中 , 图像之间的相似包括 从多种部位的医学图像中查找相同解剖范围的图像 , 还包括 从多种部位的图像中查找某种相同标注类别 ( 包括成像方 [4 ] 位、 解剖范围、 循环系统等 ) 的医学图像 。 CBIR 领域最主 “语义鸿沟 ” 。“语义鸿沟 ” 要的目标是缩小所谓 产生于对视 觉图像提取出来的描述信息与用户的视觉感知之间的差距 , “语义鸿沟” 对于专业性较强的图像 更为显著。 为了提高 CBIR 系统的准确性, 该领域涌现出了很多技 术: 视觉内容描述、 相似性测度、 分类与聚类以及搜索策略等 。 视觉内容主要通过图像处理技术构造特征向量 ; 相似度是基 于不同图像特征向量相似性测度的计算值 ; 机器学习技术也
0
引言
深入到了 CBIR 的各个环节, 包括特征向量的构造、 相似度的 计算、 以及检索结果的优化等 。 如何更有效地应用不同特征的相似性测度来提高检索算 法的性能是 CBIR 领域的研究热点之一。 Arevalilio - Herres [6 ] “乘积规则 ” 等基于朴素贝叶斯框架对相似度进行 的组合 。 Wu 等构造了混合评价函数 , 用以综合视觉相似以及标签同
Similarity combination measurement in contentbased image retrieval based on ranksupport vector machine
Oracle BPM 套件:一份关于 Oracle Corporation 的商业流程管理工具的介绍
An Ontological Approach to Oracle BPMJean Prater, Ralf Mueller, Bill BeauregardOracle Corporation, 500 Oracle Parkway, Redwood City, CA 94065, USA **********************,***********************,*****************************The Oracle Business Process Management (Oracle BPM) Suite is composed oftools for Business Analysts and Developers for the modeling of BusinessProcesses in BPMN 2.0 (OMG1 standard), Business Rules, Human Workflow,Complex Events, and many other tools. BPM operates using the commontenants of an underlying Service Oriented Architecture (SOA) runtimeinfrastructure based on the Service Component Architecture (SCA). OracleDatabase Semantic Technologies provides native storage, querying andinferencing that are compliant with W3C standards for semantic (RDF/OWL)data and ontologies, with scalability and security for enterprise-scale semanticapplications.Semantically-enabling all artifacts of BPM from the high-level design of aBusiness Process Diagram to the deployment and runtime model of a BPMapplication promotes continuous process refinement, enables comprehensiveimpact analysis and prevents unnecessary proliferation of processes andservices. This paper presents the Oracle BPM ontology based upon BPMN 2.0,Service Component Architecture (SCA) and the Web Ontology Language(OWL 2). The implementation of this ontology provides a wide range of usecases in the areas of Process Analysis, Governance, Business Intelligence andSystems Management. It also has the potential to bring together stakeholdersacross an Enterprise, for a true Agile End-to-End Enterprise Architecture.Example use cases are presented as well as an outlook of the evolution of theontology to cover the organizational and social aspects of Business ProcessManagement.1.IntroductionIn the 1968 film, 2001: A Space Odyssey, the movie’s antagonist, HAL, is a computer that is capable not only of speech, speech recognition, and natural language processing, but also lip reading, apparent art appreciation, interpreting and reproducing emotional behavior, reasoning, and playing chess, all while maintaining the systems on an interplanetary mission. While the solution we present in this paper does not possess all of the capabilities of HAL, the potential benefits of combining semantic technology with Oracle BPM provides the ability to define contextual relationships between business processes and provides the tools to use that context so that ‘software agents’ (programs working on behalf of people) can find the right1 Object Management Group, see 2 Jean Prater, Ralf Mueller, Bill Beauregardinformation or processes and make decisions based on the established contextual relationships.Organizations can more efficiently and effectively optimize their information technology resources through a service-oriented approach leveraging common business processes and semantics throughout their enterprise. The challenge, however, with applications built on Business Process Management (BPM) and Service Oriented Architecture (SOA) technology is that many are comprised of numerous artifacts spanning a wide range of representation formats. BPMN 2.0, the Service Component Architecture Assembly Model, Web Service definitions (in the form of WSDL), XSLT transformations, for example are all based on well defined but varying type models. To answer even simple queries on the entire BPM model, a user is left with a multitude of API’s and technologies, making the exercise difficult and highly complicated. Oracle has developed an ontology in OWL that encompasses all the artifacts of a BPM application and is stored in Oracle Database Semantic Technologies that provides a holistic view of the entire model and a unified and standardized way to query that model using SPARQL.Oracle is actively involved in the standards process and is leading industry efforts to use ontologies for metadata analysis. Oracle is also investigating the integration of organizational and social aspects of BPM using FOAF2. BPMN 2.0 task performers can be associated with a FOAF Person, Group or Organization and then used in Social Web activities to enable Business Users to collaborate on BPM models.1.1 BenefitsThe benefits of adding semantic technology to the database and to business process management in the middleware, driven by an underlying ontology are three fold:1.It promotes continuous process refinement. A less comprehensive processmodel can evolve into a complete executable process in the same model.2.It makes it easy to analyze the impact of adding, modifying or deletingprocesses and process building blocks on existing processes and webservices.3.It helps prevent unnecessary proliferation of processes and services. Combining semantic technology and business process management allows business users across organizational boundaries to find, share, and combine information and processes more easily by adding contextual relationships.1.2 Customer Use CaseThe US Department of Defense (DoD) is leading the way in the Federal Government for Architecture-driven Business Operations Transformation. A vital tenet for success is ensuring that business process models are based on a standardized representation, thus enabling the analysis and comparison of end to end business processes. This will lead to the reuse of the most efficient and effective process patterns (style guide), comprised of elements (primitives), throughout the DoD Business Mission Area. A key principle in DoD Business Transformation is its focus on data ontology. The 2 The Friend of a Friend (FOAF) project, see An Ontological Approach to Oracle BPM 3 Business Transformation Agency (BTA), under the purview of the Deputy Chief Management Officer (DCMO), has been at the forefront of efforts to develop a common vocabulary and processes in support of business enterprise interoperability through data standardization. The use of primitives and reuse of process patterns will reduce waste in overhead costs, process duplication and building and maintaining enterprise architectures. By aligning the Department of Defense Architecture Framework3 2.0 (DoDAF 2.0) with Business Process Modeling Notation 2.0 (BPMN 2.0) and partnering with industry, the BTA is accelerating the adoption of these standards to improve government business process efficiency.2.The Oracle BPM OntologyThe Oracle BPM ontology encompasses and expands the BPMN 2.0 and SCA ontologies. The Oracle BPM ontology is stored in Oracle Database Semantic Technologies and creates a composite model by establishing relationships between the OWL classes of the BPMN 2.0 ontology and the OWL classes of the SCA runtime ontology. For example, the BPMN 2.0 Process, User Task and Business Rule Task are mapped to components in the composite model. Send, Receive and Service Tasks, as well as Message Events are mapped to appropriate SCA Services and References and appropriate connections are created between the composite model artifacts. Figure 1 illustrates the anatomy of the Business Rule Task “Determine Approval Flow” that is a part of a Sales Quote demo delivered with BPM Suite.Figure 1: Anatomy of a BPMN 2.0 Business Rule Task4The diagram shows that the Business Rule Task “Determine Approval Flow” is of BPMN 2.0 type Business Rule Task and implemented by a SCA Decision Component that is connected to a BPMN Component “RequestQuote”. Also of significance is that the Decision Component exposes a Service that refers to a specific XML-Schema, which is also referred to by Data Objects in the BPMN 2.0 process RequestQuote.bpmn.3See /products/BEA_6.2/BEA/products/2009-04-27 Primitives Guidelines for Business Process Models (DoDAF OV-6c).pdf4 Visualized using TopBraid Composer TM4 Jean Prater, Ralf Mueller, Bill Beauregard3.An Ontology for BPMN 2.0With the release of the OMG BPMN 2.0 standard, a format based on XMI and XML-Schema was introduced for the Diagram Interchange (DI) and the Semantic Model. Based on the BPMN 2.0 Semantic Model, Oracle created an ontology that is comprised of the following5:•OWL classes and properties for all BPMN 2.0 Elements that are relevant for the Business Process Model.6The OWL classes, whenever possible,follow the conventions in the BPMN 2.0 UML meta model. OWL propertiesand restrictions are included by adding all of the data and object propertiesaccording to the attributes and class associations in the BPMN 2.0 model.7•OWL classes and properties for instantiations of a BPMN 2.0 process model. These OWL classes cover the runtime aspects of a BPMN 2.0process when executed by a process engine. The process engine createsBPMN 2.0 flow element instances when the process is executed. Activitylogging information is captured, including timestamps for a flow elementinstance’s activation and completion, as well as the performer of the task. The implicit (unstated) relationships in the Oracle BPM ontology can be automatically discovered using the native inferencing engine included with Oracle Database Semantic Technologies. The explicit and implicit relationships in the ontology can be queried using Oracle Database Semantic Technologies support for SPARQL (patterns matching queries) and/or mixed SPARQL in SQL queries. [6] Example SPARQL queries are shown below:Select all User Tasks in all Lanesselect ?usertask ?lanewhere {usertask rdf:type bpmn:UserTask .usertask bpmn:inLane lane}Select all flow elements with their sequence flow in lane p1:MyLane (a concrete instance of RDF type bpmn:Lane)select ?source ?targetwhere {flow bpmn:sourceFlowElement source .flow bpmn:targetFlowElement target .5 All of the classes of the BPMN 2.0 meta model that exists for technical reasons only (model m:n relationship or special containments) are not represented in the ontology6 The work in [2] describes an ontology based on BPMN 1.x for which no standardized meta model exists7 Oracle formulated SPARQL queries for envisioned use cases and added additional properties and restrictions to the ontology to support those use casesAn Ontological Approach to Oracle BPM 5 target bpmn:inLane p1:MyLane}Select all activities in process p1:MyProcess that satisfy SLA p1:MySLA select ?activity ?activityInstancewhere {activity bpmn:inProcess p1:MyProcess .activityInstance obpm:instanceOf activity .activityInstance obpm:meetSLA p1:MySLA}A unique capability of BPMN 2.0, as compared to BPEL, for instance, is its ability to promote continuous process refinement. A less comprehensive process model, perhaps created by a business analyst can evolve into a complete executable process that can be implemented by IT in the same model. The work sited in Validating Process Refinement with Ontologies[4] suggests an ontological approach for the validation of such process refinements.4.An Ontology for the SCA composite modelThe SCA composite model ontology represents the SCA assembly model and is comprised of OWL classes for Composite, Component, Service, Reference and Wire, which form the major building blocks of the assembly model. Oracle BPM ontology has OWL classes for concrete services specified by WSDL and data structures specified by XML-Schema. The transformation of the SCA assembly model to the SCA ontology includes creating finer grained WSDL and XML-Schema artifacts to capture the dependencies and relationships between concrete WSDL operations and messages to elements of some XML-Schema and their imported schemata.The SCA ontology was primarily created for the purpose of Governance and to act as a bridge between the Oracle BPM ontology and an ontology that would represent a concrete runtime infrastructure. This enables the important ability to perform impact analysis to determine, for instance, which BPMN 2.0 data objects and/or data associations are impacted by the modification of an XML-Schema element or which Web Service depends on this element. This feature helps prevent the proliferation of new types and services, and allows IT to ascertain the impact of an XML-Schema modification.5.The TechnologiesAs part of the customer use case, as referenced in section 1.2 above, we implemented a system that takes a BPM Project comprised of BPMN 2.0 process definitions, SCA assembly model, WSDL service definitions, XML-Schema and other metadata, and created appropriate Semantic data (RDF triples) for the Oracle BPM ontology. The6 Jean Prater, Ralf Mueller, Bill Beauregardtriples were then loaded into Oracle Database Semantic Technologies [3] and a SPARQL endpoint was used to except and process queries.6.ConclusionOracle BPM ontology encompasses and expands the generic ontologies for BPMN 2.0 and the SOA composite model to cover all artifacts of a BPM application from a potentially underspecified8process model in BPMN 2.0 down to the XML-Schema element and type level at runtime for process analysis, governance and Business Intelligence. The combination of RDF/OWL data storage, inferencing and SPARQL querying, as supported by Oracle Database Semantic Technologies, provides the ability to discover implicit relationships in data and find implicit and explicit relationships with pattern matching queries that go beyond classical approaches of XML-Schema, XQuery and SQL.7.AcknowledgementsWe’d like to thank Sudeer Bhoja, Linus Chow, Xavier Lopez, Bhagat Nainani and Zhe Wu for their contributions to the paper and valuable comments.8.References[1] Business Process Model and Notation (BPMN) Version 2.0,/spec/BPMN/2.0/[2] Ghidini Ch., Rospocher M., Serafini L.: BPMN Ontology,https://dkm.fbk.eu/index.php/BPMN_Ontology[3] Oracle Database Semantic Technologies,/technetwork/database/options/semantic-tech/[4] Ren Y., Groener G., Lemcke J., Tirdad R., Friesen A., Yuting Z., Pan J., Staab S.:Validating Process Refinement with Ontologies[5] Service Component Architecture (SCA), [6] Kolovski V., Wu Z., Eadon G.: Optimizing Enterprise-Scale OWL 2 RL Reasoning in aRelational Database System, ISWC 2010, page 436-452[7] “Use of End-toEnd (E2E) Business Models and Ontology in DoD Business Architectures”;Memorandum from Deputy Chief Management Office; April 4, 2011, Elizabeth A.McGrath, Deputy DCMO.[8] “Primitives and Style: A Common Vocabulary for BPM across the Enterprise”; DennisWisnosky, Chief Architect & CTO ODCMO and Linus Chow Oracle; BPM Excellence in Practice 2010; Published by Future Strategies, 20108A BPMN 2.0 model element is considered underspecified, if its valid but not all attribute values relevant for execution are specified.。
基于低秩约束的熵加权多视角模糊聚类算法
基于低秩约束的熵加权多视角模糊聚类算法张嘉旭 1王 骏 1, 2张春香 1林得富 1周 塔 3王士同1摘 要 如何有效挖掘多视角数据内部的一致性以及差异性是构建多视角模糊聚类算法的两个重要问题. 本文在Co-FKM 算法框架上, 提出了基于低秩约束的熵加权多视角模糊聚类算法(Entropy-weighting multi-view fuzzy C-means with low rank constraint, LR-MVEWFCM). 一方面, 从视角之间的一致性出发, 引入核范数对多个视角之间的模糊隶属度矩阵进行低秩约束; 另一方面, 基于香农熵理论引入视角权重自适应调整策略, 使算法根据各视角的重要程度来处理视角间的差异性. 本文使用交替方向乘子法(Alternating direction method of multipliers, ADMM)进行目标函数的优化. 最后, 人工模拟数据集和UCI (University of California Irvine)数据集上进行的实验结果验证了该方法的有效性.关键词 多视角模糊聚类, 香农熵, 低秩约束, 核范数, 交替方向乘子法引用格式 张嘉旭, 王骏, 张春香, 林得富, 周塔, 王士同. 基于低秩约束的熵加权多视角模糊聚类算法. 自动化学报, 2022,48(7): 1760−1770DOI 10.16383/j.aas.c190350Entropy-weighting Multi-view Fuzzy C-means With Low Rank ConstraintZHANG Jia-Xu 1 WANG Jun 1, 2 ZHANG Chun-Xiang 1 LIN De-Fu 1 ZHOU Ta 3 WANG Shi-Tong 1Abstract Effective mining both internal consistency and diversity of multi-view data is important to develop multi-view fuzzy clustering algorithms. In this paper, we propose a novel multi-view fuzzy clustering algorithm called en-tropy-weighting multi-view fuzzy c-means with low-rank constraint (LR-MVEWFCM). On the one hand, we intro-duce the nuclear norm as the low-rank constraint of the fuzzy membership matrix. On the other hand, the adaptive adjustment strategy of view weight is introduced to control the differences among views according to the import-ance of each view. The learning criterion can be optimized by the alternating direction method of multipliers (ADMM). Experimental results on both artificial and UCI (University of California Irvine) datasets show the effect-iveness of the proposed method.Key words Multi-view fuzzy clustering, Shannon entropy, low-rank constraint, nuclear norm, alternating direction method of multipliers (ADMM)Citation Zhang Jia-Xu, Wang Jun, Zhang Chun-Xiang, Lin De-Fu, Zhou Ta, Wang Shi-Tong. Entropy-weighting multi-view fuzzy C-means with low rank constraint. Acta Automatica Sinica , 2022, 48(7): 1760−1770随着多样化信息获取技术的发展, 人们可以从不同途径或不同角度来获取对象的特征数据, 即多视角数据. 多视角数据包含了同一对象不同角度的信息. 例如: 网页数据中既包含网页内容又包含网页链接信息; 视频内容中既包含视频信息又包含音频信息; 图像数据中既涉及颜色直方图特征、纹理特征等图像特征, 又涉及描述该图像内容的文本.多视角学习能有效地对多视角数据进行融合, 避免了单视角数据数据信息单一的问题[1−4].多视角模糊聚类是一种有效的无监督多视角学习方法[5−7]. 它通过在多视角聚类过程中引入各样本对不同类别的模糊隶属度来描述各视角下样本属于该类别的不确定性程度. 经典的工作有: 文献[8]以经典的单视角模糊C 均值(Fuzzy C-means, FCM)算法作为基础模型, 利用不同视角间的互补信息确定协同聚类的准则, 提出了Co-FC (Collaborative fuzzy clustering)算法; 文献[9]参考文献[8]的协同思想提出Co-FKM (Multiview fuzzy clustering algorithm collaborative fuzzy K-means)算法, 引入双视角隶属度惩罚项, 构造了一种新型的无监督多视角协同学习方法; 文献[10]借鉴了Co-FKM 和Co-FC 所使用的双视角约束思想, 通过引入视角权重, 并采用集成策略来融合多视角的模糊隶属收稿日期 2019-05-09 录用日期 2019-07-17Manuscript received May 9, 2019; accepted July 17, 2019国家自然科学基金(61772239), 江苏省自然科学基金(BK20181339)资助Supported by National Natural Science Foundation of China (61772239) and Natural Science Foundation of Jiangsu Province (BK20181339)本文责任编委 刘艳军Recommended by Associate Editor LIU Yan-Jun1. 江南大学数字媒体学院 无锡 2141222. 上海大学通信与信息工程学院 上海 2004443. 江苏科技大学电子信息学院 镇江2121001. School of Digital Media, Jiangnan University, Wuxi 2141222. School of Communication and Information Engineering,Shanghai University, Shanghai 2004443. School of Electronic Information, Jiangsu University of Science and Technology,Zhenjiang 212100第 48 卷 第 7 期自 动 化 学 报Vol. 48, No. 72022 年 7 月ACTA AUTOMATICA SINICAJuly, 2022度矩阵, 提出了WV-Co-FCM (Weighted view colla-borative fuzzy C-means) 算法; 文献[11]通过最小化双视角下样本与聚类中心的欧氏距离来减小不同视角间的差异性, 基于K-means 聚类框架提出了Co-K-means (Collaborative multi-view K-means clustering)算法; 在此基础上, 文献[12]提出了基于模糊划分的TW-Co-K-means (Two-level wei-ghted collaborative K-means for multi-view clus-tering)算法, 对Co-K-means 算法中的双视角欧氏距离加入一致性权重, 获得了比Co-K-means 更好的多视角聚类结果. 以上多视角聚类方法都基于成对视角来构造不同的正则化项来挖掘视角之间的一致性和差异性信息, 缺乏对多个视角的整体考虑.一致性和差异性是设计多视角聚类算法需要考虑的两个重要原则[10−14]. 一致性是指在多视角聚类过程中, 各视角的聚类结果应该尽可能保持一致.在设计多视角聚类算法时, 往往通过协同、集成等手段来构建全局划分矩阵, 从而得到最终的聚类结果[14−16]. 差异性是指多视角数据中的每个视角均反映了对象在不同方面的信息, 这些信息互为补充[10],在设计多视角聚类算法时需要对这些信息进行充分融合. 综合考虑这两方面的因素, 本文拟提出新型的低秩约束熵加权多视角模糊聚类算法(Entropy-weigh-ting multi-view fuzzy C-means with low rank con-straint, LR-MVEWFCM), 其主要创新点可以概括为以下3个方面:1)在模糊聚类框架下提出了面向视角一致性的低秩约束准则. 已有的多视角模糊聚类算法大多基于成对视角之间的两两关系来构造正则化项, 忽视了多个视角的整体一致性信息. 本文在模糊聚类框架下从视角全局一致性出发引入低秩约束正则化项, 从而得到新型的低秩约束多视角模糊聚类算法.2) 在模糊聚类框架下同时考虑多视角聚类的一致性和差异性, 在引入低秩约束的同时进一步使用面向视角差异性的多视角香农熵加权策略; 在迭代优化的过程中, 通过动态调节视角权重系数来突出具有更好分离性的视角的权重, 从而提高聚类性能.3)在模糊聚类框架下首次使用交替方向乘子法(Alternating direction method of multipliers,ADMM)[15]对LR-MVEWFCM 算法进行优化求解.N D K C m x j,k j k j =1,···,N k =1,···,K v i,k k i i =1,···,C U k =[µij,k ]k µij,k k j i 在本文中, 令 为样本总量, 为样本维度, 为视角数目, 为聚类数目, 为模糊指数. 设 表示多视角场景中第 个样本第 个视角的特征向量, , ; 表示第 个视角下, 第 个聚类中心, ; 表示第 个视角下的模糊隶属度矩阵, 其中 是第 个视角下第 个样本属于第 个聚类中心的模i =1,···,C j =1,···,N.糊隶属度, , 本文第1节在相关工作中回顾已有的经典模糊C 均值聚类算法FCM 模型[17]和多视角模糊聚类Co-FKM 模型[9]; 第2节将低秩理论与多视角香农熵理论相结合, 提出本文的新方法; 第3节基于模拟数据集和UCI (University of California Irvine)数据集验证本文算法的有效性, 并给出实验分析;第4节给出实验结论.1 相关工作1.1 模糊C 均值聚类算法FCMx 1,···,x N ∈R D U =[µi,j ]V =[v 1,v 2,···,v C ]设单视角环境下样本 , 是模糊划分矩阵, 是样本的聚类中心. FCM 算法的目标函数可表示为J FCM 可得到 取得局部极小值的必要条件为U 根据式(2)和式(3)进行迭代优化, 使目标函数收敛于局部极小点, 从而得到样本属于各聚类中心的模糊划分矩阵 .1.2 多视角模糊聚类Co-FKM 模型在经典FCM算法的基础上, 文献[9]通过引入视角协同约束正则项, 对视角间的一致性信息加以约束, 提出了多视角模糊聚类Co-FKM 模型.多视角模糊聚类Co-FKM 模型需要满足如下条件:J Co-FKM 多视角模糊聚类Co-FKM 模型的目标函数 定义为7 期张嘉旭等: 基于低秩约束的熵加权多视角模糊聚类算法1761η∆∆式(5)中, 表示协同划分参数; 表示视角一致项,由式(6)可知, 当各视角趋于一致时, 将趋于0.µij,k 迭代得到各视角的模糊隶属度 后, 为了最终得到一个具有全局性的模糊隶属度划分矩阵, Co-FKM 算法对各视角下的模糊隶属度采用几何平均的方法, 得到数据集的整体划分, 具体形式为ˆµij 其中, 为全局模糊划分结果.2 基于低秩约束的熵加权多视角模糊聚类算法针对当前多视角模糊聚类算法研究中存在的不足, 本文提出一种基于低秩约束的熵加权多视角模糊聚类新方法LR-MVEWFCM. 一方面通过向多视角模糊聚类算法的目标学习准则中引入低秩约束项, 在整体上控制聚类过程中各视角的一致性; 另一方面基于香农熵理论, 通过熵加权机制来控制各视角之间的差异性.同时使用交替方向乘子法对模型进行优化求解.U 1,···,U K U U U 设多视角隶属度 融合为一个整体的隶属度矩阵 , 将矩阵 的秩函数凸松弛为核范数, 通过对矩阵 进行低秩约束, 可以将多视角数据之间的一致性问题转化为核范数最小化问题进行求解, 具体定义为U =[U 1···U K ]T ∥·∥∗其中, 表示全局划分矩阵, 表示核范数. 式(8)的优化过程保证了全局划分矩阵的低秩约束. 低秩约束的引入, 可以弥补当前大多数多视角聚类算法仅能基于成对视角构建约束的缺陷, 从而更好地挖掘多视角数据中包含的全局一致性信息.目前已有的多视角的聚类算法在处理多视角数据时, 通常默认每个视角平等共享聚类结果[11], 但实际上某些视角的数据往往因空间分布重叠而导致可分性较差. 为避免此类视角的数据过多影响聚类效果,本文拟对各视角进行加权处理, 并构建香农熵正则项从而在聚类过程中有效地调节各视角之间的权重, 使得具有较好可分离性的视角的权重系数尽可能大, 以达到更好的聚类效果.∑Kk =1w k =1w k ≥0令视角权重系数 且 , 则香农熵正则项表示为U w k U =[U 1···U K ]T w =[w 1,···,w k ,···,w K ]K 综上所述, 本文作如下改进: 首先, 用本文提出的低秩约束全局模糊隶属度矩阵 ; 其次, 计算损失函数时考虑视角权重 , 并加入视角权重系数的香农熵正则项. 设 ; 表示 个视角下的视角权重. 本文所构建LR-MVEWFCM 的目标函数为其中, 约束条件为m =2本文取模糊指数 .2.1 基于ADMM 的求解算法(11)在本节中, 我们将使用ADMM 方法, 通过交替方向迭代的策略来实现目标函数 的最小化.g (Z )=θ∥Z ∥∗(13)(10)最小化式 可改写为如下约束优化问题:其求解过程可分解为如下几个子问题:V w U V 1) -子问题. 固定 和 , 更新 为1762自 动 化 学 报48 卷(15)v i,k 通过最小化式 , 可得到 的闭合解为U w Q Z U 2) -子问题. 固定 , 和 , 更新 为(17)U (t +1)通过最小化式 , 可得到 的封闭解为w V U w 3) -子问题. 固定 和 , 更新 为Z Q U Z(20)通过引入软阈值算子, 可得式 的解为U (t+1)+Q (t )=A ΣB T U (t +1)+Q (t )S θ/ρ(Σ)=diag ({max (0,σi −θ/ρ)})(i =1,2,···,N )其中, 为矩阵 的奇异值分解, 核范数的近邻算子可由软阈值算子给出.Q Z U Q 5) -子问题. 固定 和 , 更新 为w =[w 1,···,w k ,···,w K ]U ˜U经过上述迭代过程, 目标函数收敛于局部极值,同时得到不同视角下的模糊隶属度矩阵. 本文借鉴文献[10]的集成策略, 使用视角权重系数 和模糊隶属度矩阵 来构建具有全局特性的模糊空间划分矩阵 :w k U k k 其中, , 分别表示第 个视角的视角权重系数和相应的模糊隶属度矩阵.LR-MVEWFCM 算法描述如下:K (1≤k ≤K )X k ={x 1,k ,···,x N,k }C ϵT 输入. 包含 个视角的多视角样本集, 其中任意一个视角对应样本集 , 聚类中心 , 迭代阈值 , 最大迭代次数 ;v (t )i,k ˜Uw k 输出. 各视角聚类中心 , 模糊空间划分矩阵和各视角权重 ;V (t )U (t )w (t )t =0步骤1. 随机初始化 , 归一化 及 ,;(21)v (t +1)i,k 步骤2. 根据式 更新 ;(23)U (t +1)步骤3. 根据式 更新 ;(24)w (t +1)k 步骤4. 根据式 更新 ;(26)Z (t +1)步骤5. 根据式 更新 ;(27)Q (t +1)步骤6. 根据式 更新 ;L (t +1)−L (t )<ϵt >T 步骤7. 如果 或者 , 则算法结束并跳出循环, 否则, 返回步骤2;w k U k (23)˜U步骤8. 根据步骤7所获取的各视角权重 及各视角下的模糊隶属度 , 使用式 计算 .2.2 讨论2.2.1 与低秩约束算法比较近年来, 基于低秩约束的机器学习模型得到了广泛的研究. 经典工作包括文献[16]中提出LRR (Low rank representation)模型, 将矩阵的秩函数凸松弛为核范数, 通过求解核范数最小化问题, 求得基于低秩表示的亲和矩阵; 文献[14]提出低秩张量多视角子空间聚类算法(Low-rank tensor con-strained multiview subspace clustering, LT-MSC),7 期张嘉旭等: 基于低秩约束的熵加权多视角模糊聚类算法1763在各视角间求出带有低秩约束的子空间表示矩阵;文献 [18] 则进一步将低秩约束引入多模型子空间聚类算法中, 使算法模型取得了较好的性能. 本文将低秩约束与多视角模糊聚类框架相结合, 提出了LR-MVEWFCM 算法, 用低秩约束来实现多视角数据间的一致性. 本文方法可作为低秩模型在多视角模糊聚类领域的重要拓展.2.2.2 与多视角Co-FKM 算法比较图1和图2分别给出了多视角Co-FKM 算法和本文LR-MVEWFCM 算法的工作流程.多视角数据Co-FKM视角 1 数据视角 2 数据视角 K 数据各视角间两两约束各视角模糊隶属度集成决策函数划分矩阵ÛU 1U 2U K图 1 Co-FKM 算法处理多视角聚类任务工作流程Fig. 1 Co-FKM algorithm for multi-view clustering task本文算法与经典的多视角Co-FKM 算法在多视角信息的一致性约束和多视角聚类结果的集成策略上均有所不同. 在多视角信息的一致性约束方面, 本文将Co-FKM 算法中的视角间两两约束进一步扩展到多视角全局一致性约束; 在多视角聚类结果的集成策略上, 本文不同于Co-FKM 算法对隶属度矩阵简单地求几何平均值的方式, 而是将各视角隶属度与视角权重相结合, 构建具有视角差异性的集成决策函数.3 实验与分析3.1 实验设置本文采用模拟数据集和UCI 中的真实数据集进行实验验证, 选取FCM [17]、CombKM [19]、Co-FKM [9]和Co-Clustering [20]这4个聚类算法作为对比算法, 参数设置如表1所示. 实验环境为: Intel Core i5-7400 CPU, 其主频为2.3 GHz, 内存为8 GB.编程环境为MATLAB 2015b.本文采用如下两个性能指标对各算法所得结果进行评估.1) 归一化互信息(Normalized mutual inform-ation, NMI)[10]N i,j i j N i i N j j N 其中, 表示第 类与第 类的契合程度, 表示第 类中所属样本量, 表示第 类中所属样本量, 而 表示数据的样本总量;2) 芮氏指标(Rand index, RI)[10]表 1 参数定义和设置Table 1 Parameter setting in the experiments算法算法说明参数设置FCM 经典的单视角模糊聚类算法m =min (N,D −1)min (N,D −1)−2N D 模糊指数 ,其中, 表示样本数, 表示样本维数CombKM K-means 组合 算法—Co-FKM 多视角协同划分的模糊聚类算法m =min (N,D −1)min (N,D −1)−2η∈K −1K K ρ=0.01模糊指数 , 协同学习系数 ,其中, 为视角数, 步长 Co-Clustering 基于样本与特征空间的协同聚类算法λ∈{10−3,10−2, (103)µ∈{10−3,10−2,···,103}正则化系数 ,正则化系数 LR-MVEWFCM 基于低秩约束的熵加权多视角模糊聚类算法λ∈{10−5,10−4, (105)θ∈{10−3,10−2, (103)m =2视角权重平衡因子 , 低秩约束正则项系数, 模糊指数 MVEWFCMθ=0LR-MVEWFCM 算法中低秩约束正则项系数 λ∈{10−5,10−4, (105)m =2视角权重平衡因子 , 模糊指数 多视角数据差异性集成决策函数各视角模糊隶属度U 1U 2U K各视角权重W 1W 2W kLR-MVEWFCM 视角 1 数据视角 2 数据视角 K 数据整体约束具有视角差异性的划分矩阵Û图 2 LR-MVEWFCM 算法处理多视角聚类任务工作流程Fig. 2 LR-MVEWFCM algorithm for multi-viewclustering task1764自 动 化 学 报48 卷f 00f 11N [0,1]其中, 表示具有不同类标签且属于不同类的数据配对点数目, 则表示具有相同类标签且属于同一类的数据配对点数目, 表示数据的样本总量. 以上两个指标的取值范围介于 之间, 数值越接近1, 说明算法的聚类性能越好. 为了验证算法的鲁棒性, 各表中统计的性能指标值均为算法10次运行结果的平均值.3.2 模拟数据集实验x,y,z A 1x,y,z A 2x,y,z A 3x,y,z 为了评估本文算法在多视角数据集上的聚类效果, 使用文献[10]的方法来构造具有三维特性的模拟数据集A ( ), 其具体生成过程为: 首先在MATLAB 环境下采用正态分布随机函数normrnd 构建数据子集 ( ), ( )和 ( ), 每组对应一个类簇, 数据均包含200个样本.x,y,z 其中第1组与第2组数据集在特征z 上数值较为接近, 第2组与第3组数据集在特征x 上较为接近;然后将3组数据合并得到集合A ( ), 共计600个样本; 最后对数据集内的样本进行归一化处理. 我们进一步将特征x , y , z 按表2的方式两两组合, 从而得到多视角数据.表 2 模拟数据集特征组成Table 2 Characteristic composition of simulated dataset视角包含特征视角 1x,y 视角 2y,z 视角 3x,z将各视角下的样本可视化, 如图3所示.通过观察图3可以发现, 视角1中的数据集在空间分布上具有良好的可分性, 而视角2和视角3的数据在空间分布上均存在着一定的重叠, 从而影Z YZZXYX(a) 模拟数据集 A (a) Dataset A(b) 视角 1 数据集(b) View 1(c) 视角 2 数据集(c) View 2(d) 视角 3 数据集(d) View 3图 3 模拟数据集及各视角数据集Fig. 3 Simulated data under multiple views7 期张嘉旭等: 基于低秩约束的熵加权多视角模糊聚类算法1765响了所在视角下的聚类性能. 通过组合不同视角生成若干新的数据集, 如表3所示, 并给出了LR-MVEWFCM重复运行10次后的平均结果和方差.表 3 模拟数据实验算法性能对比Table 3 Performance comparison of the proposedalgorithms on simulated dataset编号包含特征NMI RI1视角1 1.0000 ± 0.0000 1.0000 ± 0.0000 2视角20.7453 ± 0.00750.8796 ± 0.0081 3视角30.8750 ± 0.00810.9555 ± 0.0006 4视角1, 视角2 1.0000 ± 0.0000 1.0000 ± 0.0000 5视角1, 视角3 1.0000 ± 0.0000 1.0000 ± 0.0000 6视角2, 视角30.9104 ± 0.03960.9634 ± 0.0192 7视角2, 视角3 1.0000 ± 0.0000 1.0000 ± 0.0000对比LR-MVEWFCM在数据集1~3上的性能, 我们发现本文算法在视角1上取得了最为理想的效果, 在视角3上的性能要优于视角2, 这与图3中各视角数据的空间可分性是一致的. 此外, 将各视角数据两两组合构成新数据集4~6后, LR-MVEWFCM算法都得到了比单一视角更好的聚类效果, 这都说明了本文采用低秩约束来挖掘多视角数据中一致性的方法, 能够有效提高聚类性能.基于多视角数据集7, 我们进一步给出本文算法与其他经典聚类算法的比较结果.从表4中可以发现, 由于模拟数据集在某些特征空间下具有良好的空间可分性, 所以无论是本文的算法还是Co-Clustering算法、FCM算法等算法均取得了很好的聚类效果, 而CombKM算法的性能较之以上算法则略有不足, 分析其原因在于CombKM算法侧重于挖掘样本之间的信息, 却忽视了多视角之间的协作, 而本文算法通过使用低秩约束进一步挖掘了多视角之间的全局一致性, 因而得到了比CombKM算法更好的聚类效果.3.3 真实数据集实验本节采用5个UCI数据集: 1) Iris数据集; 2) Image Segmentation (IS) 数据集; 3) Balance数据集; 4) Ionosphere数据集; 5) Wine数据集来进行实验. 由于这几个数据集均包含了不同类型的特征,所以可以将这些特征进行重新分组从而构造相应的多视角数据集. 表5给出了分组后的相关信息.我们在多视角数据集上运行各多视角聚类算法; 同时在原数据集上运行FCM算法. 相关结果统计见表6和表7.NMI RI通过观察表6和表7中的和指标值可知, Co-FKM算法的聚类性能明显优于其他几种经典聚类算法, 而相比于Co-FKM算法, 由于LR-MVEWFCM采用了低秩正则项来挖掘多视角数据之间的一致性关系, 并引入多视角自适应熵加权策略, 从而有效控制各视角之间的差异性. 很明显, 这种聚类性能更为优异和稳定, 且收敛性的效果更好.表6和表7中的结果也展示了在IS、Balance、Iris、Ionosphere和Wine数据集上, 其NMI和RI指标均提升3 ~ 5个百分点, 这也说明了本文算法在多视角聚类过程中的有效性.为进一步说明本文低秩约束发挥的积极作用,将LR-MVEWFCM算法和MVEWFCM算法共同进行实验, 算法的性能对比如图4所示.从图4中不难发现, 无论在模拟数据集上还是UCI真实数据集上, 相比较MVEWFCM算法, LR-MVEWFCM算法均可以取得更好的聚类效果. 因此可见, LR-MVEWFCM目标学习准则中的低秩约束能够有效利用多视角数据的一致性来提高算法的聚类性能.为研究本文算法的收敛性, 同样选取8个数据集进行收敛性实验, 其目标函数变化如图5所示.从图5中可以看出, 本文算法在真实数据集上仅需迭代15次左右就可以趋于稳定, 这说明本文算法在速度要求较高的场景下具有较好的实用性.综合以上实验结果, 我们不难发现, 在具有多视角特性的数据集上进行模糊聚类分析时, 多视角模糊聚类算法通常比传统单视角模糊聚类算法能够得到更优的聚类效果; 在本文中, 通过在多视角模糊聚类学习中引入低秩约束来增强不同视角之间的一致性关系, 并引入香农熵调节视角权重关系, 控制不同视角之间的差异性, 从而得到了比其他多视角聚类算法更好的聚类效果.表 4 模拟数据集7上各算法的性能比较Table 4 Performance comparison of the proposed algorithms on simulated dataset 7数据集指标Co-Clustering CombKM FCM Co-FKM LR-MVEWFCMA NMI-mean 1.00000.9305 1.0000 1.0000 1.0000 NMI-std0.00000.14640.00000.00000.0000 RI-mean 1.00000.9445 1.0000 1.0000 1.0000 RI-std0.00000.11710.00000.00000.00001766自 动 化 学 报48 卷3.4 参数敏感性实验LR-MVEWFCM算法包含两个正则项系数,λθθθθλλ即视角权重平衡因子和低秩约束正则项系数, 图6以LR-MVEWFCM算法在模拟数据集7上的实验为例, 给出了系数从0到1000过程中, 算法性能的变化情况, 当低秩正则项系数= 0时, 即不添加此正则项, 算法的性能最差, 验证了本文加入的低秩正则项的有效性, 当值变化过程中, 算法的性能相对变化较小, 说明本文算法在此数据集上对于值变化不敏感, 具有一定的鲁棒性; 而当香农熵正则项系数= 0时, 同样算法性能较差, 也说明引入此正则项的合理性. 当值变大时, 发现算法的性能也呈现变好趋势, 说明在此数据集上, 此正则项相对效果比较明显.4 结束语本文从多视角聚类学习过程中的一致性和差异性两方面出发, 提出了基于低秩约束的熵加权多视角模糊聚类算法. 该算法采用低秩正则项来挖掘多视角数据之间的一致性关系, 并引入多视角自适应熵加权策略从而有效控制各视角之间的差异性,从而提高了算法的性能. 在模拟数据集和真实数据集上的实验均表明, 本文算法的聚类性能优于其他多视角聚类算法. 同时本文算法还具有迭代次数少、收敛速度快的优点, 具有良好的实用性. 由于本文采用经典的FCM框架, 使用欧氏距离来衡量数据对象之间的差异,这使得本文算法不适用于某些高维数据场景. 如何针对高维数据设计多视角聚类算法, 这也将是我们今后的研究重点.表 5 基于UCI数据集构造的多视角数据Table 5 Multi-view data constructdedbased on UCI dataset编号原数据集说明视角特征样本视角类别8IS Shape92 31027 RGB99Iris Sepal长度215023 Sepal宽度Petal长度2Petal宽度10Balance 天平左臂重量262523天平左臂长度天平右臂重量2天平右臂长度11Iris Sepal长度115043 Sepal宽度1Petal长度1Petal宽度112Balance 天平左臂重量162543天平左臂长度1天平右臂重量1天平右臂长度113Ionosphere 每个特征单独作为一个视角135134214Wine 每个特征单独作为一个视角1178133表 6 5种聚类方法的NMI值比较结果Table 6 Comparison of NMI performance of five clustering methods编号Co-Clustering CombKM FCM Co-FKM LR-MVEWFCM 均值P-value均值P-value均值P-value均值P-value均值80.5771 ±0.00230.00190.5259 ±0.05510.20560.5567 ±0.01840.00440.5881 ±0.01093.76×10−40.5828 ±0.004490.7582 ±7.4015 ×10−172.03×10−240.7251 ±0.06982.32×10−70.7578 ±0.06981.93×10−240.8317 ±0.00648.88×10−160.9029 ±0.0057100.2455 ±0.05590.01650.1562 ±0.07493.47×10−50.1813 ±0.11720.00610.2756 ±0.03090.10370.3030 ±0.0402110.7582 ±1.1703×10−162.28×10−160.7468 ±0.00795.12×10−160.7578 ±1.1703×10−165.04×10−160.8244 ±1.1102×10−162.16×10−160.8768 ±0.0097120.2603 ±0.06850.38250.1543 ±0.07634.61×10−40.2264 ±0.11270.15730.2283 ±0.02940.01460.2863 ±0.0611130.1385 ±0.00852.51×10−90.1349 ±2.9257×10−172.35×10−130.1299 ±0.09842.60×10−100.2097 ±0.03290.04830.2608 ±0.0251140.4288 ±1.1703×10−161.26×10−080.4215 ±0.00957.97×10−090.4334 ±5.8514×10−172.39×10−080.5295 ±0.03010.43760.5413 ±0.03647 期张嘉旭等: 基于低秩约束的熵加权多视角模糊聚类算法1767表 7 5种聚类方法的RI 值比较结果Table 7 Comparison of RI performance of five clustering methods编号Co-ClusteringCombKM FCMCo-FKM LR-MVEWFCM均值P-value 均值P-value 均值P-value 均值P-value 均值80.8392 ±0.0010 1.3475 ×10−140.8112 ±0.0369 1.95×10−70.8390 ±0.01150.00320.8571 ±0.00190.00480.8508 ±0.001390.8797 ±0.0014 1.72×10−260.8481 ±0.0667 2.56×10−50.8859 ±1.1703×10−16 6.49×10−260.9358 ±0.0037 3.29×10−140.9665 ±0.0026100.6515 ±0.0231 3.13×10−40.6059 ±0.0340 1.37×10−60.6186 ±0.06240.00160.6772 ±0.02270.07610.6958 ±0.0215110.8797 ±0.0014 1.25×10−180.8755 ±0.0029 5.99×10−120.8859 ±0.0243 2.33×10−180.9267 ±2.3406×10−16 5.19×10−180.9527 ±0.0041120.6511 ±0.02790.01560.6024 ±0.0322 2.24×10−50.6509 ±0.06520.11390.6511 ±0.01890.0080.6902 ±0.0370130.5877 ±0.0030 1.35×10−120.5888 ±0.0292 2.10×10−140.5818 ±1.1703×10−164.6351 ×10−130.6508 ±0.01470.03580.6855 ±0.0115140.7187 ±1.1703×10−163.82×10−60.7056 ±0.01681.69×10−60.7099 ±1.1703×10−168.45×10−70.7850 ±0.01620.59050.7917 ±0.0353R I数据集N M I数据集(a) RI 指标(a) RI(b) NMI 指标(b) NMI图 4 低秩约束对算法性能的影响(横坐标为数据集编号, 纵坐标为聚类性能指标)Fig. 4 The influence of low rank constraints on the performance of the algorithm (the X -coordinate isthe data set number and the Y -coordinate is the clustering performance index)目标函数值1 096.91 096.81 096.61 096.71 096.51 096.41 096.31 096.21 096.1目标函数值66.266.065.665.865.465.2迭代次数05101520目标函数值7.05.06.55.54.04.53.03.5迭代次数05101520迭代次数05101520目标函数值52.652.251.451.851.050.6迭代次数05101520×106(a) 数据集 7(a) Dataset 7(b) 数据集 8(b) Dataset 8(c) 数据集 9(c) Dataset 9(d) 数据集 10(d) Dataset 101768自 动 化 学 报48 卷ReferencesXu C, Tao D C, Xu C. Multi-view learning with incompleteviews. IEEE Transactions on Image Processing , 2015, 24(12):5812−58251Brefeld U. Multi-view learning with dependent views. In: Pro-ceedings of the 30th Annual ACM Symposium on Applied Com-puting, Salamanca, Spain: ACM, 2015. 865−8702Muslea I, Minton S, Knoblock C A. Active learning with mul-tiple views. Journal of Artificial Intelligence Research , 2006,27(1): 203−2333Zhang C Q, Adeli E, Wu Z W, Li G, Lin W L, Shen D G. In-fant brain development prediction with latent partial multi-view representation learning. IEEE Transactions on Medical Imaging ,2018, 38(4): 909−9184Bickel S, Scheffer T. Multi-view clustering. In: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM '04), Brighton, UK: IEEE, 2004. 19−265Wang Y T, Chen L H. Multi-view fuzzy clustering with minim-ax optimization for effective clustering of data from multiple sources. Expert Systems with Applications , 2017, 72: 457−4666Wang Jun, Wang Shi-Tong, Deng Zhao-Hong. Survey on chal-lenges in clustering analysis research. Control and Decision ,2012, 27(3): 321−328(王骏, 王士同, 邓赵红. 聚类分析研究中的若干问题. 控制与决策,2012, 27(3): 321−328)7Pedrycz W. Collaborative fuzzy clustering. Pattern Recognition Letters , 2002, 23(14): 1675−16868Cleuziou G, Exbrayat M, Martin L, Sublemontier J H. CoFKM:A centralized method for multiple-view clustering. In: Proceed-ings of the 9th IEEE International Conference on Data Mining,Miami, FL, USA: IEEE, 2009. 752−7579Jiang Y Z, Chung F L, Wang S T, Deng Z H, Wang J, Qian P J. Collaborative fuzzy clustering from multiple weighted views.IEEE Transactions on Cybernetics , 2015, 45(4): 688−70110Bettoumi S, Jlassi C, Arous N. Collaborative multi-view K-means clustering. Soft Computing , 2019, 23(3): 937−94511Zhang G Y, Wang C D, Huang D, Zheng W S, Zhou Y R. TW-Co-K-means: Two-level weighted collaborative K-means for multi-view clustering. Knowledge-Based Systems , 2018, 150:127−13812Cao X C, Zhang C Q, Fu H Z, Liu S, Zhang H. Diversity-in-duced multi-view subspace clustering. In: Proceedings of the2015 IEEE Conference on Computer Vision and Pattern Recog-nition, Boston, MA, USA: IEEE, 2015. 586−59413Zhang C Q, Fu H Z, Liu S, Liu G C, Cao X C. Low-rank tensor constrained multiview subspace clustering. In: Proceedings of the 2015 IEEE International Conference on Computer Visio,Santiago, Chile: IEEE, 2015. 1582−159014Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direc-tion method of multipliers. Foundations and Trends in Machine Learning , 2011, 3(1): 1−12215Liu G C, Lin Z C, Yan S C, Sun J, Yu Y, Ma Y. Robust recov-ery of subspace structures by low-rank representation. IEEE1616.216.015.815.615.415.215.0目标函数值目标函数值目标函数值51015迭代次数迭代次数迭代次数 711.2011.1511.1011.0511.0010.9510.90800700600500400300200目标函数值38.638.238.438.037.837.637.437.251015205101520迭代次数 705101520(e) 数据集 11(e) Dataset 11(f) 数据集 12(f) Dataset 12(g) 数据集 13(g) Dataset 13(h) 数据集 14(h) Dataset 14图 5 LR-MVEWFCM 算法的收敛曲线Fig. 5 Convergence curve of LR-MVEWFCM algorithm图 6 模拟数据集7上参数敏感性分析Fig. 6 Sensitivity analysis of parameters on simulated dataset 77 期张嘉旭等: 基于低秩约束的熵加权多视角模糊聚类算法1769。
textrank算法的基本原理_概述及解释说明
textrank算法的基本原理概述及解释说明1. 引言1.1 概述在信息爆炸时代,人们每天都会接触到大量的文本信息,如新闻报道、社交媒体评论、学术论文等。
如何从海量的文本中提取出关键信息变得越来越重要。
关键词提取和文本摘要生成是两个基本的自然语言处理任务,旨在帮助用户快速理解和浏览文本内容。
textrank算法是一种基于图模型的无监督算法,通过分析文本中单词之间的相互关系来计算单词或句子的重要性,并根据其重要性对其进行排序。
该算法最初由Mihalcea等人于2004年提出,在自然语言处理领域具有广泛应用。
1.2 文章结构本文将介绍textrank算法的基本原理,并详细解释其在关键词提取和文本摘要生成两个任务中的应用。
接着,我们将通过三个主要步骤来解释算法实现过程,包括数据预处理、构建词图网络以及计算节点重要性得分。
在第四部分,我们将对textrank算法的优点和缺点进行分析,并讨论可能的改进措施。
最后,在结论与展望部分,我们将总结textrank算法的主要发现和贡献,并展望其在未来研究方向和应用场景中的潜力。
1.3 目的本文的目的是深入探讨textrank算法在自然语言处理中的应用。
通过详细解释算法原理和实现过程,我们希望读者能够全面了解textrank算法,并对其在关键词提取和文本摘要生成等任务中的有效性有更深入的认识。
同时,通过分析算法的优缺点和讨论可能的改进措施,我们希望为该领域的研究者提供进一步研究和改进的思路。
最终,我们希望本文能够启发人们对于自然语言处理技术的思考,并促进相关领域的发展与创新。
2. textrank算法的基本原理:2.1 关键词提取:关键词提取是textrank算法的一个重要应用,它可以自动从文本中抽取出关键词。
textrank算法利用单词或短语在文本中的共现关系来计算关键词的重要性。
首先,将文本进行分词处理,得到一组单词或短语。
然后,通过构建一个无向有权图来表示这些单词或短语之间的共现关系。
研究生专业词汇
2-dimensional space3D mapabstractaccess dataAccessibilityaccuracyacquisitionad-hocadjacencyadventaerial photographsAge of dataagglomerationaggregateairborneAlbers Equal-Area Conic projection (ALBER alignalphabeticalphanumericalphanumericalalternativealternativealtitudeameliorateanalogue mapsancillaryANDannotationanomalousapexapproachappropriatearcarc snap tolerancearealAreal coverageARPA abbr.Advanced Research Projects Agen arrangementarrayartificial intelligenceArtificial Neural Networks (ANN) aspatialaspectassembleassociated attributeattributeattribute dataautocorrelationautomated scanningazimuthazimuthalbar chartbiasbinary encodingblock codingBoolean algebrabottombottom leftboundbreak linebufferbuilt-incamouflagecardinalcartesian coordinate system cartographycatchmentcellcensuscentroidcentroid-to-centroidCGI (Common Gateway Interface) chain codingchainscharged couple devices (ccd) children (node)choropleth mapclass librariesclassesclustercodecohesivelycoilcollinearcolumncompactcompasscompass bearingcomplete spatial randomness (CSR) componentcompositecomposite keysconcavityconcentricconceptual modelconceptuallyconduitConformalconformal projectionconic projectionconnectivityconservativeconsortiumcontainmentcontiguitycontinuouscontourcontour layercontrol pointsconventionconvertcorecorrelogramcorrespondencecorridorCostcost density fieldcost-benefit analysis (CBA)cost-effectivecouplingcovariancecoveragecoveragecriteriacriteriacriterioncross-hairscrosshatchcross-sectioncumbersomecustomizationcutcylindrical projectiondangledangle lengthdangling nodedash lineDATdata base management systems (DBMS) data combinationdata conversiondata definition language (DDL)data dictionarydata independencedata integritydata itemdata maintenancedata manipulationData manipulation and query language data miningdata modeldata representationdata tabledata typedatabasedateDBAdebris flowdebugdecadedecibeldecision analysisdecision makingdecomposededicateddeductiveDelaunay criterionDelaunay triangulationdelete(erase)delineatedemarcationdemographicdemonstratedenominatorDensity of observationderivativedetectabledevisediagonaldictatedigital elevation model (DEM)digital terrain model (DTM) digitizedigitizedigitizerdigitizing errorsdigitizing tablediscrepancydiscretediscretedisparitydispersiondisruptiondissecteddisseminatedissolvedistance decay functionDistributed Computingdividedomaindot chartdraftdragdrum scannersdummy nodedynamic modelingeasy-to-useecologyelicitingeliminateellipsoidellipticityelongationencapsulationencloseencodeentity relationship modelingentity tableentryenvisageepsilonequal area projectionequidistant projectionerraticerror detection & correctionError Maperror varianceessenceet al.EuclideanEuclidean 2-spaceexpected frequencies of occurrences explicitexponentialextendexternal and internal boundaries external tablefacetfacilityfacility managementfashionFAT (file allocation table)faultyfeaturefeaturefeedbackfidelityfieldfield investigationfield sports enthusiastfields modelfigurefile structurefillingfinenessfixed zoom infixed zoom outflat-bed scannerflexibilityforefrontframe-by framefreefrom nodefrom scratchfulfillfunction callsfuzzyFuzzy set theorygantrygenericgeocodinggeocomputationgeodesygeographic entitygeographic processgeographic referencegeographic spacegeographic/spatial information geographical featuresgeometricgeometric primitive geoprocessinggeoreferencegeo-relational geosciences geospatialgeo-spatial analysis geo-statisticalGiven that GNOMONIC projection grain tolerance graticulegrey scalegridhand-drawnhand-heldhandicaphandlehand-written header recordheftyheterogeneity heterogeneous heuristichierarchical hierarchicalhill shading homogeneoushosthouseholdshuehumichurdlehydrographyhyper-linkedi.e.Ideal Point Method identicalidentifiable identification identifyilluminateimageimpedanceimpedanceimplementimplementimplicationimplicitin excess of…in respect ofin terms ofin-betweeninbuiltinconsistencyincorporationindigenousinformation integration infrastructureinherentinheritanceinlandinstanceinstantiationintegerintegrateinteractioninteractiveinteractiveinternet protocol suite Internet interoperabilityinterpolateinterpolationinterrogateintersectintersectionIntersectionInterval Estimation Method intuitiveintuitiveinvariantinventoryinvertedirreconcilableirreversibleis adjacent tois completely withinis contained iniso-iso-linesisopleth mapiterativejunctionkeyframekrigingKriginglaglanduse categorylatitudelatitude coordinatelavalayerlayersleaseleast-cost path analysisleftlegendlegendlegendlength-metriclie inlightweightlikewiselimitationLine modelline segmentsLineage (=history)lineamentlinearline-followinglitho-unitlocal and wide area network logarithmiclogicallogicallongitudelongitude coordinatemacro languagemacro-like languagemacrosmainstreammanagerialmanual digitizingmany-to-one relationMap scalemarshalmaskmatricesmatrixmeasured frequencies of occurrences measurementmedialMercatorMercator projectionmergemergemeridiansmetadatameta-datametadatamethodologymetric spaceminimum cost pathmirrormis-representmixed pixelmodelingmodularmonochromaticmonolithicmonopolymorphologicalmosaicmovemoving averagemuiticriteria decision making (MCDM) multispectralmutually exclusivemyopicnadirnatureneatlynecessitatenestednetworknetwork analysisnetwork database structurenetwork modelnodenodenode snap tolerancenon-numerical (character)non-spatialnon-spatial dataNormal formsnorth arrowNOTnovicenumber of significant digit numeric charactersnumericalnumericalobject-based modelobjectiveobject-orientedobject-oriented databaseobstacleomni- a.on the basis ofOnline Analytical Processing (OLAP) on-screen digitizingoperandoperatoroptimization algorithmORorderorganizational schemeoriginorthogonalORTHOGRAPHIC projectionortho-imageout ofoutcomeoutgrowthoutsetovaloverdueoverheadoverlapoverlayoverlay operationovershootovershootspackagepairwisepanpanelparadigmparent (node)patchpath findingpatternpatternpattern recognitionperceptionperspectivepertain phenomenological photogrammetric photogrammetryphysical relationships pie chartpilotpitpixelplanarplanar Euclidean space planar projection platformplotterplotterplottingplug-inpocketpoint entitiespointerpoint-modepointspolar coordinates polishingpolygonpolylinepolymorphism precautionsprecisionpre-designed predeterminepreferences pregeographic space Primary and Foreign keys primary keyprocess-orientedprofileprogramming tools projectionprojectionproprietaryprototypeproximalProximitypseudo nodepseudo-bufferpuckpuckpuckPythagorasquadquadrantquadtreequadtree tessellationqualifyqualitativequantitativequantitativequantizequasi-metricradar imageradii bufferrangelandrank order aggregation method ranking methodrasterRaster data modelraster scannerRaster Spatial Data Modelrating methodrational database structureready-madeready-to-runreal-timerecordrecreationrectangular coordinates rectificationredundantreference gridreflexivereflexive nearest neighbors (RNN) regimeregisterregular patternrelationrelationalrelational algebra operators relational databaseRelational joinsrelational model relevancereliefreliefremarkremote sensingremote sensingremote sensingremotely-sensed repositoryreproducible resemblanceresembleresemplingreshaperesideresizeresolutionresolutionrespondentretrievalretrievalretrievalretrieveridgerightrobustrootRoot Mean Square (RMS) rotateroundaboutroundingrowrow and column number run-length codingrun-length encoded saddle pointsalientsamplesanitarysatellite imagesscalablescalescanscannerscannerscannerscarcescarcityscenarioschemascriptscrubsecurityselectselectionself-descriptiveself-documentedsemanticsemanticsemi-automatedsemi-major axessemi-metricsemi-minor axessemivariancesemi-variogram modelsemi-varogramsensorsequencesetshiftsillsimultaneous equations simultaneouslysinusoidalskeletonslide-show-stylesliverslope angleslope aspectslope convexitysnapsnapsocio-demographic socioeconomicspagettiSpatial Autocorrelation Function spatial correlationspatial dataspatial data model for GIS spatial databaseSpatial Decision Support Systems spatial dependencespatial entityspatial modelspatial relationshipspatial relationshipsspatial statisticsspatial-temporalspecificspectralspherical spacespheroidsplined textsplitstakeholdersstand alonestandard errorstandard operationsstate-of-the-artstaticSTEREOGRAPHIC projection STEREOGRAPHIC projection stereoplotterstorage spacestovepipestratifiedstream-modestrideStructured Query Language(SQL) strung outsubdivisionsubroutinesubtractionsuitesupercedesuperimposesurrogatesurveysurveysurveying field data susceptiblesymbolsymbolsymmetrytaggingtailoredtake into account of … tangencytapetastefullyTelnettentativeterminologyterraceterritorytessellatedtextureThe Equidistant Conic projection (EQUIDIS The Lambert Conic Conformal projection (L thematicthematic mapthemeThiessen mapthird-partythresholdthroughputthrust faulttictiertiletime-consumingto nodetolerancetonetopographic maptopographytopologicaltopological dimensiontopological objectstopological structuretopologically structured data set topologytopologytrade offtrade-offTransaction Processing Systems (TPS) transformationtransposetremendousTriangulated Irregular Network (TIN) trimtrue-direction projectiontupleunbiasednessuncertaintyunchartedundershootsunionunionupupdateupper- mosturban renewaluser-friendlyutilityutility functionvaguevalidityvarianceVariogramvectorvector spatial data model vendorverbalversusvertexvetorizationviablevice versavice versaview of databaseview-onlyvirtualvirtual realityvisibility analysisvisualvisualizationvitalVoronoi Tesselationvrticeswatershedweedweed toleranceweighted summation method whilstwithin a distance ofXORzoom inzoom out三维地图摘要,提取,抽象访问数据可获取性准确,准确度 (与真值的接近程度)获得,获得物,取得特别邻接性出现,到来航片数据年龄聚集聚集,集合空运的, (源自)航空的,空中的艾伯特等面积圆锥投影匹配,调准,校直字母的字母数字的字母数字混合编制的替换方案替代的海拔,高度改善,改良,改进模拟地图,这里指纸质地图辅助的和注解不规则的,异常的顶点方法适合于…弧段弧捕捉容限来自一个地区的、 面状的面状覆盖范围(美国国防部)高级研究计划署排列,布置数组,阵列人工智能人工神经网络非空间的方面, 方向, 方位, 相位,面貌采集,获取关联属性属性属性数据自动扫描方位角,方位,地平经度方位角的条状图偏差二进制编码分块编码布尔代数下左下角给…划界断裂线缓冲区分析内置的伪装主要的,重要的,基本的笛卡儿坐标系制图、制图学流域,集水区像元,单元人口普查质心质心到质心的公共网关接口链式编码链电荷耦合器件子节点地区分布图类库类群编码内聚地线圈在同一直线上的列压缩、压紧罗盘, 圆规, 范围 v.包围方位角完全空间随机性组成部分复合的、混合的复合码凹度,凹陷同心的概念模型概念上地管道,导管,沟渠,泉水,喷泉保形(保角)的等角投影圆锥投影连通性保守的,守旧的社团,协会,联盟包含关系相邻性连续的轮廓,等高线,等值线等高线层控制点习俗,惯例,公约,协定转换核心相关图符合,对应走廊, 通路费用花费密度域,路径权值成本效益分析有成本效益的,划算的结合协方差面层,图层覆盖,覆盖范围标准,要求标准,判据,条件标准,判据,条件十字丝以交叉线作出阴影截面麻烦的用户定制剪切圆柱投影悬挂悬挂长度悬挂的节点点划线数据文件的扩展名数据库管理系统数据合并数据变换数据定义语言数据字典与数据的无关数据的完整性数据项数据维护数据操作数据操作和查询语言数据挖掘数据模型数据表示法数据表数据类型数据库日期数据库管理员泥石流调试十年,十,十年期分贝决策分析决策,判定分解专用的推论的,演绎的狄拉尼准则狄拉尼三角形删除描绘划分人口统计学的说明分母,命名者观测密度引出的,派生的可察觉的发明,想出对角线的,斜的要求数字高程模型数字地形模型数字化数字化数字化仪数字化误差数字化板,数字化桌差异,矛盾不连续的,离散的不连续的,离散的不一致性分散,离差中断,分裂,瓦解,破坏切开的,分割的发散,发布分解距离衰减函数分布式计算分割域点状图草稿,起草拖拽滚筒式扫描仪伪节点动态建模容易使用的生态学导出消除椭球椭圆率伸长包装,封装围绕编码实体关系建模实体表进入,登记想像,设想,正视,面对希腊文的第五个字母ε等积投影等距投影不稳定的误差检查和修正误差图误差离散,误差方差本质,本体,精华以及其他人,等人欧几里得的,欧几里得几何学的欧几里得二维空间期望发生频率明显的指数的延伸内外边界外部表格(多面体的)面工具设备管理样子,方式文件分配表有过失的,不完善的(地理)要素,特征要素反馈诚实,逼真度,重现精度字段现场调查户外运动发烧友场模型外形, 数字,文件结构填充精细度以固定比例放大以固定比例缩小平板式扫描仪弹性,适应性,机动性,挠性最前沿逐帧无…的起始节点从底层完成,实现函数调用模糊的模糊集合论构台,桶架, 跨轨信号架通用的地理编码地理计算大地测量地理实体地理(数据处理)过程地理参考地理空间地理信息,空间信息地理要素几何的,几何学的几何图元地理(数据)处理过程地理坐标参考地理关系的地球科学地理空间的地学空间分析地质统计学的假设心射切面投影颗粒容差地图网格灰度栅格,格网手绘的手持的障碍,难点处置、处理手写的头记录重的,强健的异质性异构的启发式的层次层次的山坡(体)阴影图均匀的、均质的主机家庭色调腐植的困难,阻碍水文地理学超链接的即,换言之,也就是理想点法相同的可识别的、标识识别阐明图像,影像全电阻,阻抗阻抗实现,履行履行,实现牵连,暗示隐含的超过…关于根据…在中间的嵌入的,内藏的不一致性,矛盾性结合,组成公司(或社团)内在的,本土的信息集成基础设施固有的继承,遗传, 遗产内陆的实例,例子实例,个例化整数综合,结合相互作用交互式的交互式的协议组互操作性内插插值询问相交交集、逻辑的乘交区间估值法直觉的直觉的不变量存储,存量反向的,倒转的,倒置的互相对立的不能撤回的,不能取消的相邻完全包含于包含于相等的,相同的线族等值线图迭代的接合,汇接点主帧克里金内插法克里金法标签,标记间隙,迟滞量土地利用类别纬度 (B)纬度坐标熔岩,火山岩图层图层出租,租用最佳路径分析左图例图例图例长度量测在于小型的同样地限制,限度,局限线模型线段谱系,来源容貌,线性构造线性的,长度的,直线的线跟踪的岩性单元局域和广域网对数的逻辑的逻辑的经度 (L)经度坐标宏语言类宏语言宏主流管理人的, 管理的手工数字化多对一的关系地图比例尺排列,集合掩膜matrix 的复数矩阵实测发生频率量测中间的合并墨卡托墨卡托投影法合并合并,融合子午线元数据元数据,也可写为 metadata元数据方法学,方法论度量空间最佳路径镜像错误表示混合像素建模模块化的单色的,单频整体的垄断, 专利权, 专卖形态学镶嵌, 镶嵌体移动移动平均数多准则决策分析多谱线的,多谱段的相互排斥的短视,没有远见的最低点,天底,深渊,最底点本性,性质整洁地成为必要嵌套的、巢状的网络网络分析网状数据库结构网络模型节点节点节点捕捉容限非数值的(字符)非空间的非空间数据范式指北针非新手,初学者有效位数数字字符数值的数值的基于对象的模型客观的,目标的面向对象的模型面向对象的数据库阻碍全能的,全部的以…为基础在线分析处理屏幕数字化运算对象,操作数算子,算符,操作人员优化算法或次,次序组织方案原点,起源,由来直角的,直交的正射投影正射影像缺少结果长出,派出,结果,副产物开头 ,开端卵形的,椭圆形的迟到的管理费用重叠,叠加叠加叠置运算超出过头线软件包成对(双)地,两个两个地平移面,板范例、父节点补钉,碎片,斑点路径搜索图案式样,图案, 模式模式识别感觉,概念,理解力透视图从属, 有关, 适合现象学的,现象的摄影测量的摄影测量物理关系饼图导航洼坑象素平面的平面欧几里得空间平面投影平台绘图仪绘图仪绘图插件便携式,袖珍式,小型的点实体指针点方式点数,分数极坐标抛光多边形多义线,折线多形性,多态现象预防措施精确, 精度(多次测量结果之间的敛散程度) 预定义的,预设计的预定、预先偏好先地理空间主外键主码面向处理的纵剖面、轮廓编程工具投影投影所有权,业主原型,典型最接近的,近侧的接近性假的, 伪的伪节点缓冲区查询(数字化仪)鼠标数字化鼠标鼠标毕达哥拉斯方庭,四方院子象限,四分仪四叉树四叉树方格限定,使合格定性的量的定量的、数量的使量子化准量测雷达影像以固定半径建立缓冲区牧场,放牧地等级次序集合法等级评定法栅格栅格数据模型栅格扫描仪栅格空间数据模型分数评定法关系数据结构现成的随需随运行的实时记录娱乐平面坐标纠正多余的,过剩的, 冗余的参考网格自反的自反最近邻体制,状态,方式配准规则模式关系关系关系代数运算符关系数据库关系连接中肯,关联,适宜,适当地势起伏,减轻地势的起伏评论,谈论,谈到遥感遥感遥感遥感的知识库可再产生的相似,相似性,相貌相似类似,像重取样调整形状居住, 驻扎调整大小分辨率分辨率回答者,提取检索检索检索高压脊右稳健的根部均方根旋转迂回的舍入的、凑整的行行和列的编号游程长度编码行程编码鞍点显著的,突出的,跳跃的,凸出的样品, 标本, 样本卫生状况卫星影像可升级的比例尺扫描扫描仪扫描仪扫描仪缺乏,不足情节模式脚本,过程(文件)灌木安全, 安全性选择选择自定义的自编程的语义的,语义学的语义的,语义学的半自动化长半轴半量测短半轴半方差半变差模型半变差图传感器次序集合、集、组改变, 移动基石,岩床联立方程同时地正弦的骨骼,骨架滑动显示模式裂片坡度坡向坡的凸凹性咬合捕捉社会人口统计学的社会经济学的意大利面条自相关函数空间相互关系空间数据GIS的空间数据模型 空间数据库空间决策支持系统空间依赖性空间实体空间模型空间关系空间关系空间统计时空的具体的,特殊的光谱的球空间球状体,回转椭圆体曲线排列文字分割股票持有者单机标准误差,均方差标准操作最新的静态的极射赤面投影极射赤面投影立体测图仪存储空间火炉的烟囱形成阶层的流方式步幅,进展,进步结构化查询语言被串起的细分,再分子程序相减组, 套件,程序组,代替,取代叠加,叠印代理,代用品,代理人测量测量,测量学野外测量数据免受...... 影响的(地图)符号符号,记号对称性给...... 贴上标签剪裁讲究的考虑…接触,相切胶带、带子风流地,高雅地远程登录试验性的术语台地,露台领域,领地,地区棋盘格的,镶嵌的花样的纹理等距圆锥投影兰伯特保形圆锥射影专题的专题图主题,图层泰森图第三方的阈值生产量,生产能力,吞吐量逆冲断层地理控制点等级,一排,一层,平铺费时间的终止节点允许(误差)、容差、容限、限差色调地形图地形学拓扑的拓扑维数拓扑对象拓扑结构建立了拓扑结构的数据集拓扑关系拓扑交替换位,交替使用,卖掉交换,协定,交易事务处理系统变换,转换转置,颠倒顺序巨大的不规则三角网修整真方向投影元组不偏性不确定性海图上未标明的,未知的欠头线合并并集、逻辑的和上升级最上面的城市改造用户友好的效用, 实用,公用事业效用函数含糊的效力,正确,有效性方差,变差变量(变化记录)图矢量矢量空间数据模型经销商言语的, 动词的对,与…相对顶点 (单数)矢量化可实行的,可行的反之亦然反之亦然数据库的表示只读的虚拟的虚拟现实通视性分析视觉的可视化,使看得见的重大的沃伦网格顶点(复数)分水岭杂草,野草 v.除草,铲除清除容限度加权求和法同时在 ...... 距离内异或放大缩小。
基于HowNet的词汇语义倾向计算
基于HowNet的词汇语义倾向计算作者:朱嫣岚, 闵锦, 周雅倩, 黄萱菁, 吴立德, ZHU Yan-lan, MIN Jin, ZHOU Ya-qian , HUANG Xuan-jing, WU Li-de作者单位:复旦大学,计算机科学与工程系,上海,200433刊名:中文信息学报英文刊名:JOURNAL OF CHINESE INFORMATION PROCESSING年,卷(期):2006,20(1)被引用次数:119次参考文献(9条)1.Vasileios Hatzivassiloglou;Kathleen R McKeown Predicting the semantic orientation of adjectives 19972.Turney Peter;Littman Michael Measuring praise and criticism:Inference of semantic orientation from association[外文期刊] 2003(04)3.Turney Peter Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews[外文会议] 20024.BoPang;LillianLee Shivakumar Vaithyanathan.Thumbsup? Sentimentclessificationusingmachinelearningtechniques 20025.Bo Pang;Lillian Lee Seeing Stars:Exploiting Class Relationships for Sentiment Categorization with respect to Rating Scales[外文会议] 20056.K Dave;S Lawrence;DM Pennock Mining the peanut gallery:opinion extraction and semantic classification of product reviews 20037.Bing Liu;Minqing Hu;Junsheng Cheng Opinion observer:analyzing and comparing opinions on the Web 20058.HowNet HowNet's Home Page9.刘群;李素建基于《知网》的词汇语义相似度的计算 2002本文读者也读过(3条)1.赵林.胡恬.黄萱菁.吴立德基于知网的概念特征抽取方法[期刊论文]-通信学报2004,25(7)2.许云.樊孝忠.张锋.XU Yun.FAN Xiao-Zhong.ZHANG Feng基于知网的语义相关度计算[期刊论文]-北京理工大学学报2005,25(5)3.熊德兰.程菊明.田胜利.XIONG De-lan.CHENG Ju-ming.TIAN Sheng-li基于HowNet的句子褒贬倾向性研究[期刊论文]-计算机工程与应用2008,44(22)引证文献(118条)1.贾珊珊.邸书灵.范通让基于表情符号和情感词的文本情感分析模型[期刊论文]-河北省科学院学报 2013(2)2.甘小红.张兆年基于多特征融合的中文情感分类方法研究[期刊论文]-图书情报工作 2012(21)3.刘培奇.凡星.段中兴倾向性文本的概念图过滤技术的研究[期刊论文]-微电子学与计算机 2012(12)4.徐群岭一种新型的中文文本情感计算模型[期刊论文]-计算机应用与软件 2011(6)5.张素智.孙培锋基于KSVM的网络评论情感分类研究[期刊论文]-郑州轻工业学院学报(自然科学版) 2011(3)6.李娟.张全.贾宁中文词语倾向性分析处理[期刊论文]-计算机工程与应用 2009(2)8.程亮.何志浩.李留英.李龙中文BBS内容安全监控模型构想[期刊论文]-情报杂志 2007(12)9.李华.储荷兰.高旻中文网络评论观点词汇语义褒贬倾向性判断[期刊论文]-计算机应用 2012(11)10.庞娜增量学习算法对文本情感识别模型的改进[期刊论文]-电脑开发与应用 2011(7)11.何凤英基于语义理解的中文博文倾向性分析[期刊论文]-计算机应用 2011(8)12.彭学仕.孙春华面向倾向性分析的基于词聚类的基准词选择方法[期刊论文]-计算机应用研究 2011(1)13.任小燕中文情感分析综述[期刊论文]-科技信息 2011(31)14.孙春华.刘业政.彭学仕一种含强度的基准词选择和词汇倾向性判别方法[期刊论文]-情报学报 2011(12)15.杨经.林世平基于SVM的文本词句情感分析[期刊论文]-计算机应用与软件 2011(9)16.李娟.张全.贾宁.臧翰芬基于模板的中文人物评论意见挖掘[期刊论文]-计算机应用研究 2010(3)17.党蕾.张蕾一种基于知网的中文句子情感倾向判别方法[期刊论文]-计算机应用研究 2010(4)18.程显毅.杨天明.朱倩.蔡月红基于语义倾向性的文本过滤研究[期刊论文]-计算机应用研究 2009(12)19.宋施恩.樊兴华基于词共现和词上下文的领域观点词抽取方法[期刊论文]-计算机工程与设计 2013(11)20.陈涛.徐睿峰.吴明芬.刘滨一种基于情感句模的文本情感分类方法[期刊论文]-中文信息学报 2013(5)21.黄硕.周延泉基于知网和同义词词林的词汇语义倾向计算[期刊论文]-软件 2013(2)22.龙珑.邓伟绿色网络博文倾向性分析算法研究[期刊论文]-计算机应用研究 2013(4)23.程传鹏.王海龙情感倾向判断中基准词的选择[期刊论文]-智能系统学报 2013(4)24.李寿山.黄居仁基于Stacking组合分类方法的中文情感分类研究[期刊论文]-中文信息学报 2010(5)25.张健.钱杰.徐茂兴网络评论观点抽取的研究[期刊论文]-浙江工业大学学报 2010(4)26.沈凤仙.朱巧明基于特征倾向性的网页特征提取方法研究[期刊论文]-计算机工程与设计 2009(16)27.梁坤.古丽拉·阿东别克基于SVM的中文新闻评论的情感自动分类研究[期刊论文]-电脑知识与技术 2009(13)28.崔鸿达.蒋朝惠基于语义倾向性分析的不良文本检测模型研究[期刊论文]-贵州大学学报(自然科学版)2013(3)29.周胜臣.瞿文婷.石英子.施询之.孙韵辰中文微博情感分析研究综述[期刊论文]-计算机应用与软件 2013(3)30.李国林.万常选.边海容.杨莉.钟敏娟基于语素的金融证券域文本情感探测[期刊论文]-计算机研究与发展2011(z2)31.李培.何中市.黄永文基于依存关系分析的网络评论极性分类研究[期刊论文]-计算机工程与应用 2010(11)32.闻彬词语情感倾向性识别[期刊论文]-咸宁学院学报 2010(6)33.赵煜.蔡皖东.樊娜.李慧贤利用词汇分布相似度的中文词汇语义倾向性计算[期刊论文]-西安交通大学学报2009(6)34.马海兵.刘永丹.王兰成.李荣陆三种文档语义倾向性识别方法的分析与比较[期刊论文]-现代图书情报技术2007(4)35.王兰成.徐震基于情感本体的主题网络舆情倾向性分析[期刊论文]-信息与控制 2013(1)36.魏韡.向阳一种新的中文词语情感极性判别方法[期刊论文]-微电子学与计算机 2013(5)37.张莉跨领域中文评论的情感分类研究[期刊论文]-计算机应用研究 2013(3)38.李迎凯.徐小良一种改进的基于知网的句子相似度计算方法[期刊论文]-电子科技 2012(7)39.邓箴一种基于本体的词汇语义倾向计算[期刊论文]-中小企业管理与科技 2012(13)40.吴丽华.冯建平.曹均阔中文网络评论的IT产品特征挖掘及情感倾向分析[期刊论文]-计算机与数字工程41.张瑞SMS.网络舆情信息监控系统的设计与实现[期刊论文]-现代情报 2012(3)42.金宇.朱洪波.王亚强.陈黎.于中华基于直推式学习的中文情感词极性判别[期刊论文]-计算机工程与应用2011(34)43.宋晓雷.王素格.李红霞.李德玉基于概率潜在语义分析的词汇情感倾向判别[期刊论文]-中文信息学报 2011(2)44.王晓东.刘倩.张征情感词汇Ontology驱动的话题倾向性计算[期刊论文]-计算机工程与应用 2011(27)45.陈发鸿基于核方法的文本极性分类研究[期刊论文]-海峡科学 2011(8)46.杨昱昺.吴贤伟改进的基于知网词汇语义褒贬倾向性计算[期刊论文]-计算机工程与应用 2009(21)47.申晓晔.封化民.毋非基于语义的Web新闻内容倾向性分析框架[期刊论文]-郑州大学学报(理学版) 2009(1)48.徐琳宏.林鸿飞.杨志豪基于语义理解的文本倾向性识别机制[期刊论文]-中文信息学报 2007(1)49.杨国泰.陈启安一种预测文本情感分类词语权值的算法[期刊论文]-电脑知识与技术 2013(12)50.马晓玲.金碧漪.范并思中文文本情感倾向分析研究[期刊论文]-情报资料工作 2013(1)51.任远.巢文涵.周庆.李舟军基于话题自适应的中文微博情感分析[期刊论文]-计算机科学 2013(11)52.张梅.段建勇概念与属性间语义约束知识的获取方法研究[期刊论文]-语言文字应用 2012(1)53.丁晟春.文能.蒋婷.孟美任基于CRF模型的半监督学习迭代观点句识别研究[期刊论文]-情报学报 2012(10)54.魏韡.向阳.陈千中文文本情感分析综述[期刊论文]-计算机应用 2011(12)55.赵鹏.何留进.孙凯.方薇基于情感计算的网络中文信息分析技术[期刊论文]-计算机技术与发展 2010(11)56.闻彬.何婷婷.罗乐.宋乐.王倩基于语义理解的文本情感分类方法研究[期刊论文]-计算机科学 2010(6)57.柳位平.朱艳辉.栗春亮.向华政.文志强中文基础情感词词典构建方法研究[期刊论文]-计算机应用 2009(10)58.杜伟夫.谭松波.云晓春.程学旗一种新的情感词汇语义倾向计算方法[期刊论文]-计算机研究与发展 2009(10)59.王素格.李德玉.魏英杰.宋晓雷基于同义词的词汇情感倾向判别方法[期刊论文]-中文信息学报 2009(5)60.李钝.乔保军.曹元大.万月亮基于语义分析的词汇倾向识别研究[期刊论文]-模式识别与人工智能 2008(4)61.卢玲.王越.杨武一种基于朴素贝叶斯的中文评论情感分类方法研究[期刊论文]-山东大学学报(工学版)2013(6)62.朱俭文本情感研究综述[期刊论文]-软件导刊 2012(9)63.路冬媛.李秋丹一种融合读者心情要素的新闻推送方法[期刊论文]-中文信息学报 2011(3)64.陈岳峰.苗夺谦.李文.张志飞基于概念的词汇情感倾向识别方法[期刊论文]-智能系统学报 2011(6)65.宋乐.何婷婷.王倩.闻彬极性相似度计算在词汇倾向性识别中的应用[期刊论文]-中文信息学报 2010(4)66.孟凡博.蔡莲红.陈斌.吴鹏文本褒贬倾向判定系统的研究[期刊论文]-小型微型计算机系统 2009(7)67.白鸽.左万利.赵乾坤.曲仁镜使用机器学习对汉语评论进行情感分类[期刊论文]-吉林大学学报(理学版)2009(6)68.孙宏纲.陆余良中文博客主题情感句自动抽取研究[期刊论文]-计算机工程与应用 2008(20)69.孙宏纲.陆余良.刘金红.龚笔宏基于HowNet的VSM模型扩展在文本分类中的应用研究[期刊论文]-中文信息学报2007(6)70.徐鹏基于直觉模糊推理的网页在线评论情感倾向分类[期刊论文]-计算机应用与软件 2013(6)71.金鑫.李小腾.朱建明突发事件网络舆情的演变机制及其情感性分析研究[期刊论文]-现代情报 2012(12)72.王晓东.李永波.郑颖基于模板匹配的网络评论倾向性分析[期刊论文]-计算机工程与应用 2012(32)73.王晓莉.古里拉·阿东别克哈萨克语语句情感识别研究初探[期刊论文]-计算机应用与软件 2011(8)75.姚天昉.娄德成汉语语句主题语义倾向分析方法的研究[期刊论文]-中文信息学报 2007(5)76.杨震.赖英旭.段立娟.李玉鑑基于上下文重构的短文本情感极性判别研究[期刊论文]-自动化学报 2012(1)77.王铁套.王国营.陈越.黄惠新基于语义模式与词汇情感倾向的舆情态势研究[期刊论文]-计算机工程与设计2012(1)78.代大明.王中卿.李寿山.李培峰.朱巧明基于情绪词的非监督中文情感分类方法研究[期刊论文]-中文信息学报2012(4)79.李芳.何婷婷.宋乐评价主题挖掘及其倾向性识别[期刊论文]-计算机科学 2012(6)80.陈铭.李生红.陈秀真基于句式结构的评论倾向性识别方法[期刊论文]-通信技术 2011(2)81.万月亮.朱贺军.刘宏志基于网页结构化倾向的网页分类方法研究[期刊论文]-信息网络安全 2009(9)82.江敏.肖诗斌.王弘蔚.施水才一种改进的基于《知网》的词语语义相似度计算[期刊论文]-中文信息学报2008(5)83.文涛.杨达.李娟中文软件评论挖掘系统的设计与实现[期刊论文]-计算机工程与设计 2013(1)84.杨频.李涛.赵奎一种网络舆情的定量分析方法[期刊论文]-计算机应用研究 2009(3)85.代大明.李寿山.李培峰.朱巧明基于情绪词与情感词协作学习的情感分类方法研究[期刊论文]-计算机科学2012(12)86.LI Dun.MA Yong-tao.GUO Jian-li Words semantic orientation classification based on HowNet[期刊论文]-中国邮电高校学报(英文版) 2009(1)87.侯敏.滕永林.李雪燕.陈毓麒.郑双美.侯明午.周红照话题型微博语言特点及其情感分析策略研究[期刊论文]-语言文字应用 2013(2)88.李寿山.李逸薇.黄居仁.苏艳基于双语信息和标签传播算法的中文情感词典构建方法[期刊论文]-中文信息学报 2013(6)89.常晓龙.张晖融合语素特征的中文褒贬词典构建[期刊论文]-计算机应用 2012(7)90.赵妍妍.秦兵.刘挺文本情感分析[期刊论文]-软件学报 2010(8)91.侯锋.王传廷.李国辉网络意见挖掘、摘要与检索研究综述[期刊论文]-计算机科学 2009(7)92.乐国安.董颖红.陈浩.赖凯声在线文本情感分析技术及应用[期刊论文]-心理科学进展 2013(10)93.李勇敢.周学广.孙艳.张焕国结合依存关联分析和规则统计分析的情感词库构建方法[期刊论文]-武汉大学学报(理学版) 2013(5)94.田超.朱青.覃左言.李鹏基于评论分析的查询服务推荐排序[期刊论文]-小型微型计算机系统 2011(9)95.张彬.杨志晓基于基准词的文本情感倾向性研究[期刊论文]-电脑知识与技术 2011(8)96.王翠波基于文本情感挖掘的企业技术竞争情报采集模型研究[期刊论文]-图书情报工作 2010(14)97.张亮.尹存燕.陈家骏基于语义树的中文词语相似度计算与分析[期刊论文]-中文信息学报 2010(6)98.杨超.冯时.王大玲.杨楠.于戈基于情感词典扩展技术的网络舆情倾向性分析[期刊论文]-小型微型计算机系统2010(4)99.李斌.彭勤科.张晨突发公共事件网络在线评论序列的特征分析[期刊论文]-计算机应用研究 2008(9)100.徐军.丁宇新.王晓龙使用机器学习方法进行新闻的情感自动分类[期刊论文]-中文信息学报 2007(6) 101.徐健基于网络用户情感分析的预测方法研究[期刊论文]-中国图书馆学报 2013(3)102.周杰.林琛.李弼程基于机器学习的网络新闻评论情感分类研究[期刊论文]-计算机应用 2010(4)103.郑逢强.林磊.刘秉权.孙承杰《知网》在命名实体识别中的应用研究[期刊论文]-中文信息学报 2008(5)104.潘怡.叶辉.邹军华E-learning评论文本的情感分类研究[期刊论文]-开放教育研究 2014(2)105.厉小军.戴霖.施寒潇.黄琦文本倾向性分析综述[期刊论文]-浙江大学学报(工学版) 2011(7)106.李实.叶强.李一军.罗嗣卿挖掘中文网络客户评论的产品特征及情感倾向[期刊论文]-计算机应用研究2010(8)107.王海.冯向前.钱钢网页在线评论情感倾向的直觉模糊分类[期刊论文]-计算机工程与应用 2013(1)108.周咏梅.杨佳能.阳爱民面向文本情感分析的中文情感词典构建方法[期刊论文]-山东大学学报(工学版)2013(6)109.施寒潇.厉小军主观性句子情感倾向性分析方法的研究[期刊论文]-情报学报 2011(5)110.殷春霞.彭勤科利用复杂网络为自由评论鉴定词汇情感倾向性[期刊论文]-自动化学报 2012(3)111.田超.朱青.覃左言.李鹏基于评论分析的查询服务推荐排序[期刊论文]-小型微型计算机系统 2011(9) 112.杨超.冯时.王大玲.杨楠.于戈基于情感词典扩展技术的网络舆情倾向性分析[期刊论文]-小型微型计算机系统 2010(4)113.文能.丁晟春商品主观评论信息的倾向性分析综述[期刊论文]-情报杂志 2010(12)114.傅向华.刘国.郭岩岩.郭武彪中文博客多方面话题情感分析研究[期刊论文]-中文信息学报 2013(1)115.郗亚辉.张明.袁方.王煜产品评论挖掘研究综述[期刊论文]-山东大学学报(理学版) 2011(5)116.姚天昉.程希文.徐飞玉.汉思·乌思克尔特.王睿文本意见挖掘综述[期刊论文]-中文信息学报 2008(3) 117.王洪伟.刘勰.尹裴.廖雅国Web文本情感分类研究综述[期刊论文]-情报学报 2010(5)118.赵妍妍.秦兵.刘挺文本情感分析[期刊论文]-软件学报 2010(8)引用本文格式:朱嫣岚.闵锦.周雅倩.黄萱菁.吴立德.ZHU Yan-lan.MIN Jin.ZHOU Ya-qian.HUANG Xuan-jing.WU Li-de基于HowNet的词汇语义倾向计算[期刊论文]-中文信息学报 2006(1)。
磁性聚合物微球的制备及其应用研究进展
2017年第36卷第8期 CHEMICAL INDUSTRY AND ENGINEERING PROGRESS·2971·化 工 进展磁性聚合物微球的制备及其应用研究进展王晔晨,全微雷,张金敏,沈俊海,李良超(浙江师范大学化学系,先进催化材料教育部重点实验室,浙江 金华 321004)摘要:磁性聚合物微球是由磁性粒子和聚合物复合而成。
本文在概述磁性聚合物微球的结构类型基础上,选择性地介绍了单体聚合法、原位生成法、组合法等制备方法的特点及其优缺点;综述了其在生物医药、工业催化、电磁波吸收与屏蔽领域的最新研究及应用进展;总结了该类复合材料在生物医药研究领域中存在的问题,并对其未来发展方向作出了展望。
即研究开发先进的制备方法,通过对磁性粒子的表面改性和聚合物表面基团的设计,提高其生物相容性及其与病患细胞作用的专一性;研究药物与载体之间的界面作用及其机制,实现高磁靶向性、高药物利用率和药物在靶区的可控缓释性;研究药物载体在输送过程中的变化、人体环境对其药物输送的影响以及与病患细胞的作用机制。
关键词:聚合物;微球;磁性;复合物;合成方法中图分类号:O614 文献标志码:A 文章编号:1000–6613(2017)08–2971–07 DOI :10.16085/j.issn.1000-6613.2016-2197Progress in preparation and application of magnetic polymermicrospheresWANG Yechen ,QUAN Weilei ,ZHANG Jinmin ,SHEN Junhai ,LI Liangchao(Key Laboratory of the Ministry of Education for Advanced Catalysis Materials ,Department of Chemistry ,ZhejiangNormal University ,Jinhua 321004,Zhejiang ,China )Abstract :Magnetic polymer microspheres are composed of magnetic particles and polymer. They have received considerable attentions due to their particular structure and superior properties ,and have obtained extensive applications in the medicine carrier ,biological engineering ,industrial catalysis and many other fields. In this paper ,the structure types and preparation methods of magnetic polymer microspheres are summarized ,of which the advantages and disadvantages are also discussed. And their applications ,such as biological medicine ,industrial catalysis as well as electromagnetic wave absorption and shielding ,and the latest progresses in recent years have been reviewed. Furthermore ,the current problems and future development prospects of the magnetic polymer microspheres in biomedical research have been also summarized and outlooked ,including :① developing advanced preparation methods ;② improving its biocompatibility and specificity to the patient cells through the surface modification of magnetic particles and design of polymer groups ;③ achieving high magnetic targeting property ,high drug utilization and controllable slow-releasing potential of drugs in the target area by studying on interface interaction and mechanism between drug and carrier ;④ studying the change of drug carrier in the process of conveying and the influence of human environment on the drug delivery as well as the action mechanism of the patient cells.Key words :polymer ;microspheres ;magnetism ;composite ;synthesis methods第一作者:王晔晨(1996—),女,本科生。
SEMANTIC PAGE SEGMENTATION OF VECTOR GRAPHICS DOCU
专利名称:SEMANTIC PAGE SEGMENTATION OFVECTOR GRAPHICS DOCUMENTS发明人:Xiao Yang,Paul Asente,Mehmet Ersin Yumer申请号:US15656269申请日:20170721公开号:US20190026550A1公开日:20190124专利内容由知识产权出版社提供专利附图:摘要:Disclosed systems and methods categorize text regions of an electronicdocument into document object types based on a combination of semantic information and appearance information from the electronic document. A page segmentationapplication executing on a computing device accesses textual feature representations that represent text portions in a vector space, where a set of pixels from the page is mapped to a textual feature representation. The page segmentation application generates a visual feature representation, which corresponds to an appearance of a document portion including the set of pixels, by applying a neural network to the page of the electronic document. The page segmentation application generates an output page segmentation of the electronic document by applying the neural network to the textual feature representation and the visual feature representation.申请人:Adobe Systems Incorporated地址:San Jose CA US国籍:US更多信息请下载全文后查看。
关于ocr技术的参考文献
关于ocr技术的参考文献关于OCR(光学字符识别)技术的参考文献有很多,我将从不同角度为你介绍一些相关的文献。
首先,如果你对OCR技术的基本原理和发展历史感兴趣,可以参考以下文献:T. M. Breuel的《The OCR Revolution: A Perspective》。
D. Doermann的《The History of OCR》。
S. N. Srihari的《Historical Perspectives on Document Image Analysis and Recognition》。
其次,如果你想了解OCR技术在特定领域的应用,可以参考以下文献:M. Blumenstein和Y. Kong的《Handwriting Recognition: Technologies and Applications》。
K. Roy的《Optical Character Recognition for Indic Scripts》。
S. Marinai和A. Gori的《Handwriting Recognition and Document Analysis: A Comprehensive Reference》。
另外,如果你对OCR技术的最新研究和发展趋势感兴趣,可以参考以下文献:Y. LeCun、L. Bottou、Y. Bengio和P. Haffner的《Gradient-Based Learning Applied to Document Recognition》。
A. Graves、M. Liwicki、S. Fernández和R. Bertolami的《Handwriting Recognition with Large Multilayer Networks》。
D. Karatzas、F. Shafait、S. Uchida、M. Iwamura和L. G.i Bigorda的《ICDAR 2013 Robust Reading Competition》。
多重检验加权融合的短文本相似度计算方法
作 为 界 定 是 否 相 似 的 标 准 ,因 此 ,本 文 提 出 DLR
(Damerau-Levenshtein-Ratio),其 将 2 个 文 本 的 编 辑
距 离 转 化 为 比 值 形 式 ,通 过 式(2)计 算 DLR 以 表 示
2 个文本之间的相似度:
VSM[6]和 LSA[7]等 ,三 是 基 于 深 度 学 习 的 计 算 方 法 ,
如基于深度学习语义匹配模型的 DSSM[8]、通过神经
网 络 生 成 词 向 量 以 计 算 相 似 度 的 Word2vec[9]和
Glove[10]等 。 文 献[11]基 于 CNN 并 引 入 多 注 意 力 机
响,在分析传统文本相似度计算方法的基础上,利用
基于深度学习的方法计算相似度,通过阈值对相似度
值 进 行 检 验 筛 选 ,并 将 改 进 的 Damerau-Levenshtein
距 离 算 法 、考 虑 词 频 的 语 义 相 似 度 计 算 算 法 、基 于
机
工
0 ≤ j ≤ n,通 过 式(1)来 计 算 2 个 字 符 串 之 间 的
极大关注[2]。
目前,文本相似度计算方法主要分为三类,一是
基 于 字 符 串 的 计 算 方 法 ,如 通 过 统 计 文 本 共 有 字 词
数 量 计 算 相 似 度 的 N-gram[3]和 Jaccard[4]算 法 ,二 是
基于语料库的计算方法,如忽略词序、句法结构等关
基 金 项 目 :中 国 博 士 后 科 学 基 金(2017M613216);陕 西 省 自 然 科 学 基 金(2017JM6059);陕 西 省 重 点 研 发 计 划(2019ZDLNY07);陕 西 省
基于词向量与TextRank的关键词提取方法
基于词向量与TextRank的关键词提取⽅法
基于词向量与TextRank的关键词提取⽅法
周锦章;崔晓晖
【期刊名称】《计算机应⽤研究》
【年(卷),期】2019(036)004
【摘要】针对词汇语义的差异性对TextRank算法的影响进⾏了研究,提出⼀种基于词向量与TextRank的关键词抽取⽅法.利⽤FastText将⽂档集进⾏词向量表征,基于隐含主题分布思想和利⽤词汇间语义性的差异,构建TextRank的转移概率矩阵,最后进⾏词图的迭代计算和关键词抽取.实验结果表明,该⽅法的抽取效果相⽐于传统⽅法有明显提升,同时证明利⽤词向量能简单⽽有效地改善TextRank算法的性能.
【总页数】4页(1051-1054)
【关键词】抽取;语义差异性;TextRank;词向量;隐含主题分布
【作者】周锦章;崔晓晖
【作者单位】武汉⼤学国家⽹络安全学院,武汉430072;武汉⼤学国家⽹络安全学院,武汉430072
【正⽂语种】中⽂
【中图分类】TP391;TP301.6
【相关⽂献】
1.基于词跨度的中⽂⽂本关键词⾃动提取⽅法 [J], 谢晋
2.⼀种基于同义词的中⽂关键词提取⽅法 [J], 王永亮; 郭巧; 曹奇敏
3.关键词——⾼考试题的亮点——怎样准确提取关键词 [J], ⽂建华
4.基于组合词和同义词集的关键词提取算法 [J], 蒋昌⾦; 彭宏; 陈建超; 马千⾥;。
一种基于分词距离改进的 Lucene 排序算法
一种基于分词距离改进的 Lucene 排序算法徐茂军;王红【期刊名称】《山东师范大学学报(自然科学版)》【年(卷),期】2016(031)001【摘要】排序算法是全文检索引擎 Lucene 的核心部件。
针对 Lucene 内置的排序算法只考虑查询词条在文档中的词频,而忽视查询词条在文档中的距离特征这一缺陷,提出了一种基于分词距离特征的句子相似度计算模型用于改进 Lucene 评分机制。
首先,对查询串和文档进行数据预处理。
其次,通过在文档中标识“关键词”和“查询词条”的位置,从而实现查询词条与关键词之间分词距离的计算,进而得出查询串和整篇文档的相似性评分。
最后,将本文提出的算法融合到 Lucene 默认的相似性评分算法中,并使用MAP,P@ n 等指标进行评估。
%Ranking algorithm is the core component of the full - text Search Engine Lucene. Due to the drawback that Luce ne’s default ranking algorithm considers only the term frequency but not distance characteristic in document where query term appears. The paper gives a new sentence similarity calculation model based on the characteristic of segmentation distance to improve Lucene scoring mechanism. Firstly,the model preprocesses the query string and document,and then,calculates the segmentation distance between query string and keywords abstracted from document,by identifying keywords and query terms in the document. As a result,we get the similarity score between the query string and document. Finally,we applythe improved algorithm to the actual Lucene’s similarity ranking algorithm,and verify effectiveness by using indicators such as MAP,P@ n.【总页数】7页(P66-72)【作者】徐茂军;王红【作者单位】山东师范大学信息科学与工程学院,250014,济南; 山东省分布式计算机软件新技术重点实验室,250014,济南;山东师范大学信息科学与工程学院,250014,济南; 山东省分布式计算机软件新技术重点实验室,250014,济南【正文语种】中文【中图分类】TP301【相关文献】1.基于Lucene的中文分词技术改进 [J], 刘敏娜2.基于Lucene网页排序算法的改进 [J], 张贤;周娅3.基于Lucene的全文搜索排序算法的研究与改进 [J], 阮曙芬4.基于Lucene的全文搜索排序算法的研究与改进 [J], 阮曙芬;5.基于Lucene的中文分词器的改进与实现 [J], 罗惠峰;郭淑琴因版权原因,仅展示原文概要,查看原文内容请购买。
词汇学重要知识点
Unit One What is Lexicology?Lexicology It is a branch of linguistics dealing with the vocabulary of a language and the properties of words as the main units of language.Word It is the basic unit of speech and minimal free form which has a given sound and meaning and grammatical function.The relationship between sound and meaning is conventional because people of the same speech community have agreed to this cluster of sounds for such an animal.分类Words may fall into basic word stock and non-basic by use frequency; content words and functional words by notion, native words or borrowed words by origin.Basic words stable and indispensablecharacteristics1 All national character:.2 Stability:3 Productivity.4 Polysemy:5 collocability .Non-basic words:Terminology术语Slang俚语Jargon行话Argot 隐语Dialectal words方言Archaisms古词Neologisms新词functional words , like :prep. conj. Auxilaries and articles… They don’t have notion of their own. Content (notional) words constitute the main body of English vocabulary. They are nouns, verbs, adjectives, adverbs and numerals.Native words: Anglo-Saxon words, small in number, the core of the language, neutral in style, frequent in use.Borrowed words: are words taken over from other languagese.g. chaos, dogma, drama, pneumonia ---Greek. hymn, pope, martyr, monk, anthem, shrine, creed –old E Cradle, bald, slogan, flannel, down ---- Celtic Balcony, corridor, attack, cannon, opera---Italyvocabulary---all the words in a language together, all items in a dictionary.5. Exercises:1) Which of the following is not true? Aa. A word is the smallest form of a language.b. A word is a sound unity.c. A word has a given meaning.d. A word can be used freely in a sentence.2) The differences between sound and form are due to Da. the fact of more phonemes than letters in Englishb. stabilization of spelling by printingc. influence of the work of scribesd. innovations made by linguists3) Complete the following sentences:a. There is no intrinsic relationship between sound and meaning, The connection between them is ___arbitrary__ and conventional.b. Content words are changing all the time whereas functional words are more ___stable__. Functional words enjoy a higher frequency in use than content words.Unit Two Word Formation1.The expansion of vocabulary in modern English depends chiefly on word-formation。
刻面和规约描述相结合的构件检索
刻面和规约描述相结合的构件检索陈杜英;刘韶涛【期刊名称】《华侨大学学报(自然科学版)》【年(卷),期】2012(033)005【摘要】By analysis of components retrieval methods based on facet classification and component specification syntax. We find that the faceted classification and retrieval methods focus on the components static characteristics, without considering the component behavior description. While component specification syntax description focuses on emphasize the aspect of component behavior description of the statutes of the components which is independent of the semantics of each function in the component description. We design a software component facets retrieval algorithm based on the string matching, and considering the facets retrieval didn t support for the demand of the assembly. In the retrieval process, insert specification syntax matching process based on component specification, to provide accurate information of the component composition and improve the efficiency of reuse.%通过分析基于刻面分类和构件规约语法的检索方法,发现刻面分类及其检索方法侧重于构件的静态特征描述,没有考虑构件的行为描述等,而构件规约语法描述部分偏重于构件的行为描述,构件规约对构件中每个功能的语法描述具有独立性.设计一种基于字符串匹配的软构件刻面检索算法,同时考虑到刻面检索没有支持组装这一需求,提出在检索过程中,插入基于构件规约的语法匹配过程,以提供较准确的构件组装信息和提高复用效率.【总页数】5页(P513-517)【作者】陈杜英;刘韶涛【作者单位】华侨大学计算机科学与技术学院,福建厦门361021;华侨大学计算机科学与技术学院,福建厦门361021【正文语种】中文【中图分类】TP311.13【相关文献】1.基于刻面描述和术语的构件检索算法 [J], 李颖;李闯2.基于刻面分类描述的构件检索方法研究 [J], 舒远仲;陈志勇;彭晓红;刘炎培3.基于XML刻面构件描述与检索算法研究 [J], 鲁大营;曹宝香;王华4.领域本体和刻面描述相结合的构件检索研究 [J], 陆敬筠;宋培钟5.基于刻面描述的构件检索匹配方法应用研究 [J], 龚双;刘波;刘佩珊因版权原因,仅展示原文概要,查看原文内容请购买。
一种融合词语位置特征的Lucene相似度评分算法
一个子项目, 是一个用 Java 语言实现的开放源代码的全 文检索引擎工具包。 Lucene 以其开源特性、 优异的索引 结构、 高性能、 可伸缩、 跨平台、 易使用等特点, 被广泛用 来构建具体的全文检索应用、 Web 应用, 集成到各种系 统软件中, 例如 IBM 的开源软件 Eclipse 的搜索功能等。 尽管如此, Lucene 也存在一些不足之处。其中之一 就是, 检索得到的结果往往不尽如人意, 其原因归咎于 Lucene 内部的相似度评分算法。一直以来, 对于如何改 善 Lucene 文档检索质量的研究和改进从未间断。文献[2]
129
一种融合词语位置特征的 Lucene 相似度评分算法
2 白培发 1, 王成良 1, , 徐
玲2
2 BAI Peifa1, WANG Chengliang1, , XU Ling2
1.重庆大学 计算机学院, 重庆 400030 2.重庆大学 软件学院, 重庆 400030 1.College of Computer Science, Chongqing University, Chongqing 400030, China 2.College of Software Engineering, Chongqing University, Chongqing 400030, China BAI Peifa, WANG Chengliang, XU Ling. Scoring algorithm of similarity based on terms’ position feature combination for Lucene. Computer Engineering and Applications, 2014, 50 (2) : 129-132. Abstract:The scoring algorithm of similarity is one of the core parts in Lucene. After the analysing and researching on the default scoring algorithm of Lucene similarity, this paper proposes an improved algorithm aimed at the deficiency of the Lucene’ s default algorithm which only considers the frequencies rather than the position of query terms occurrence. The improved algorithm combines the feature of the terms’position relationship with Lucene’ s default scoring algorithm of similarity. The experiment on the TREC dataset shows that, the improved algorithm increases the value of evaluation metric MAP and P@ n to a certain extent. Key words: Lucene; similarity; full text search 摘 要: 相似度评分算法是 Lucene 引擎中的核心部分之一。对 Lucene 内部的相似度评分算法进行研究分析后, 针
基于排序集成的自动术语识别方法
词特征 域 聚合度 否预处理 背景语料
无 无 有 无 有 无 有 有 有 无 无 无
用 了不同的统计 信息 , 的 A R算法包括 有 TIF CV u、 常见 T FD 、-a e l We des Gos X等。这些算法大多是关注于候选术语 在领 i ns 和 l E r s 域语料库 ( 或背景语 料库 ) 的某 种统计信息 , 据此判 断候选术语 项作为术语 的可能性 , 从而产生候选 术语项的排序 。然而 , 不同
处理阶段 , 它利用词性标 注器和 名词短语切 分器等语 言学工具 来处理文本语料库 , 而提取 出候选术语 的集合 ; 从 而关 于术语 变
体识别技术 , 可以根据一个 术语 的词根形 态来得到 与此相关 的
具体实现方式 。第二步基于统计的术语 识别 , 简言之 , 就是利用 统计信息来赋予每个候选术语一个相应的权重 , 并输 出具有最高
i r v d t e a c rc fa tmai e e o n t n mp o e h c u a y o uo t t r r c g i o . c m i Ke wo d y rs Ra k a g e ain Auo t e m e o n t n T x n n I fr ain e t cin n g rg t o tma i tr r c g i o e t c i mii g n o m t xr t o a o
AT l oi msu e c a a t r t n o ma in o e ti s e t ,t e a e n t e be l tt n . T e p p r i t d c s t e lc lKe n R ag r h s h r ce si if r t fc r n a p cs h y h v o i a l i a i s t i c o a c mi o h a e nr u e h o a me y o o t l t o o d a wi R is e n rs n s an w r n g r g t n meh d pi h d t e l t AT su ,a d p e e t e a k a g e ai to .E p r n e u t s o h t h to a in f a t ma me h o x e me trs l h w t a e meh h s s i c n l i s t d g i y
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Semantic-based Composite Document RankingChunchen Liu, Jianqiang LiNEC Laboratories ChinaBeijing, China{liu_chunchen, li_jianqiang}@Abstract—The traditional inform ation retrieval techniques m ainly em ploy statistics of words in docum ent text and/or the link structures of document sets to rank, which have been used successfully in the global web search. However, they produce unsatisfied results for Enterprise search (ES), because ES is very different from Web search. This paper proposes a novel rank approach fitting for the ES environment. With the support of an ontology describing prior knowledge about the target domain, we first mine semantic information (concepts and relations between them) from queries (docum ents) with which to understand the query intentions (docum ent contents) and exploit them for evaluating the query-document relevance; and then the semantic linkages between docum ents are built and consum ed for evaluating the docum ent im portance; finally, the above two evaluations are integrated to produce the final ranking list. Experim ents show that our approach results in significant improvements over existing solutions.Keywords-document ranking; semantic information; enterprise search; enterprise information retrievalI.I NTRODUCTIONEnterprise search (ES) gets a hypergrowth in recent years as the explosive extension of enterprise data. However, when employing the traditional rank models in the ES setting, it soon became clear that they performed poorly for the following reasons: (1) the highly professional and purposive search tasks supported by ES decide that it has a higher accuracy requirement than the Web search does, but the traditional keyword-based relevance computing techniques have intrinsic deficiency in Precision and Recall [1]. (2) Some researchers incorporate semantic information as an effective tool for relevance computing optimization [2~4]. However, they mainly utilize ontology concepts recognized from query (document) for relevance scoring, where the semantic relations defined in ontology are only used for semantic query expansion or semantic similarity calculation, with the role of relations in helping query intention (document content) understanding and correctly matching not fully used. E.g., given a query "Tom hanks movie", several interpretations are reasonable such as "movie d irected by Tom hanks", "movie that Tom hanks acts in", "movie produced by Tom hanks", but which interpretation is the one in the user's mind? With the help of relations, we can match documents with queries more properly in the semantic level. (3) Unlike a set of hyperlinks relate pages together in the web, there are no such linkage relationship between documents in the ES environment, which makes the existing hyperlink-based importance computing solutions [5~6] not applicable.This paper proposes a semantic-based composite rank model (SCRD) meant for the exploitation of full-fledged domain knowledge base to support document ranking. With respect to other strategies, our model not only uses concepts, but also mines the implicit relations between them from queries (documents) and employs them for query (document) understanding and relevance computing. In addition, with the hyperlink missing, SCRD develops a novel method to compute the document importance which will be used in conjunction with the document relevance to further improve the accuracy of rank results. Experiment results show that SCRD performs better on rank accuracy compared with the existing models.II.S EMANTIC-BASED C OMPOSITE D OCUMENT R ANKINGDomainTargetto beUser inputsFigure 1. SCRD rank processOur proposed approach (Fig.1) comprises three stages: (1) Query-document relevance computing; (2) Document importance computing; (3) Composite document ranking. In the first stage, the relevance of each document in the candidate doc set to the query is evaluated by comparing their contained semantic concepts and relations. In succession, the linkages between target documents and reference documents are built and consumed for scoring the document importance. In the last stage, the two factors acquired above are combined for ranking. A.Query-document Relevance ComputingDomainTargetFigure 2. score R computing processEvaluating the relevance between a query and a document by analyzing their contained semantic relations, in other words,2012 IEEE Sixth International Conference on Semantic Computingcomputing the relation-based relevance score, score R , for each candidate document, is the core of our approach. Fig. 2 depicts how to acquire score R , which comprises five steps. 1) Query Concept RecognitionThis step adopts our technique introduced in [7] to map query keywords to concepts in the domain ontology. It first maps each keyword to a set of candidate concepts, and then for each query concept combination, a penalty value is calculated by analyzing the semantic graphs that represent the combination, finally, the top-1 combination with the minimum penalty value is chosen with its contained concepts act as the query concepts.2) Query Intention UnderstandingThis step tries to construct a set of query semantic graphs , S q ={(G q , ηq )| 0≤ηq ≤ }, based on the query concept set C query and the domain ontology to imitate all possible query intentions. A query semantic graph G q is a weighted connected graph, each concept C i ęC query is mapped to a node of G q , the edge between node C i , C j represents a semantic path linking C i with C j in the ontology, with a parameter recording the concepts appearing in the path and a weight ωij denoting the importance of the edge ( its value is the path length). ηq denotes the possibility of the intention described by G q to be just the one in the user ’s mind. The method to acquire S q is as follows. ķMinimum query semantic graph construction. We extract the minimum query semantic graph from the ontology, which is a sub-graph with the minimum number of edges to link all query concept nodes together. (Due to the space limitation, the detailed algorithm is not introduced here.). The number of edges in the sub-graph is denoted as W g-min . ĸ S q acquiring based on W g-min . F irstly, we construct a weighted graph G all (V all ,E all ). Each query concept is a node in V all . Each semantic path path (C i ,C j ) linking C i with C j in the ontology (C i ,C j ę V all ) is mapped to an edge linking C i and C j in E all , with the edge weight parameter (ω) set as lengh (path (C i ,C j )) and the record parameter set as the concept set on the path. When finding semantic paths between C i and C j , make sure that all concepts belonging to V all {C i , C j } shouldn ’t appear in the path. One thing for attention, when two paths are inverse path, only one should be added to E all , in addition, the taxonomy relation in ontology is considered as a unidirectional relation when constructing G all . Secondly, we extract S q from G all (V all ,E all ). A spanning tree of G all (V all ,E all ) with the sum of its edge weights satisfies W g-min ≤W g ≤W g-min +d is a query semantic graph , and its ηq =111n g i i W W (n =|S q |), then all thefound spanning trees comprise S q . d = 2 in our implementation. 3) Document Concept RecognitionThe approach showed in [8], which not only recognizes but also disambiguates concepts, is adopted by this step for annotating document concepts.4) Document Content UnderstandingThis step constructs a set of d ocument semantic graphs , S d ={(G d ,ηd )| 0≤ηd ≤ }, for a document to imitate all possible query intentions covered by it. Let C doc be the set of concepts identified from a document. A d ocument semantic graph G d (V d ,E d ) is a weighted connected graph. Each concept C i ęC doc ģC query is a node in V d . Each edge e (C i C j )ęE d represents a semantic path linking C i with C j in the ontology, with a weight λij denoting the importance of the edge and a parameterrecording the concepts on the path. To insure the run time efficiency, we construct S d offline before the ranking process. The detailed method is as follows.ķConstructing a document semantic graph set S sub for each C sub . (C sub C doc , 2≤|C sub |≤d , d is set to 5 in our implementation).The way to build S sub is nearly same to that to build S q for C query . The only difference is that when finding semantic paths between concepts, make sure that all concepts in the semantic paths must belong to C doc . ĸ Acquiring S d based on {(C sub ,S sub )}. When a query comes, we first compare the query concept set C query with all the document concept subsets C sub . The C sub that satisfies C sub = C doc ģC query is chosen. If C sub == C query , then its corresponding S sub is chosen as S d directly, or else, S sub should be modified by resetting the graph weight for each graph it contains (denoted as G d ), which is realized by first finding all the super graphs of G d from S q , denoted as S'={G q (V q ,E q )| G q (V q ,E q )ęS q && G d is a sub graph of G q }, and then changing the graph weight of G d from ηd to 'min ((||1)(||1))q G S q d q V V K u .5) score R ComputingAfter acquiring the query semantic graph set S q ={(G q , ηq )| 0≤ηq ≤ } and the document semantic graph set S d ={(G d , ηd )| 0≤ηd ≤ }, score R can be computed using Formula (1).score R = d d q qd G S qG S K K ¦¦(1)When score R is obtained, we apply the vector-space model [9] on top of ontology concepts to calculate the concept-based relevance score score C . Then, the combined relevance score of a document with a query is measured as r =G score R +(1-G ) score C (0d G d 1).B. Document Importance ComputingAs the traditional importance computing algorithms are not applicable in the ES environment, we propose a novel approach that evaluates the document importance with the help of external document set acting as implicit knowledge base. Its realization consists of two main steps: (1) The class-instance linkages between target docs and external docs are built by comparing their content, aiming at finding to what extent a target document is supported by external documents (a target document is a class, an external document can be labeled as an instance of a class). F or realization, we adopt our fully automatic text categorization approach FACT [10] for linkage building. (2) Linkage consuming. This step uses the number of instance documents in the external document database to compute the importance of the target document: ķF or each target document d i , we count the number of instantiated external documents n (d i ).ĸAssuming target document d j has the highest number of instance documents max , its importance is H ; and the importance score of each other target document d i is measured as score I (i )=H ×n (d i )/max . H ę[0,1] is the maximum value a target can have.C. Composite Document RankingBoth the query-document relevance and the document importance are used to produce the final ranking list, with an idea that when documents have approximate relevance degreeto the query, the more important a document is, the upper position it should be put on. The ranking process is as follows: ķBy adopting the query-document relevance score as the only feature, the target documents are clustered into k groups. ĸFor each group, computing the average value of the relevance scores of the contained documents, denoted as S1. The sequence of these groups is decided by S1. ĹUsing document importance score to re-rank documents in each group. k is a factor balancing the query-document relevance and the document importance. We adopted the improved k-means algorithm [11] to cluster target documents.III. E XPERIMENT AND E VALUATIONAn English dataset (Internet Movie Database, IMDB) and a Chinese dataset (Vehicle repair dataset, VR) are chosen to test our model. We acquire IMDB via public available plain text exports ( ) and parse it into IMDB ontology serving as the semantic knowledge base, which contains 12 classes, 16 types of relations, 7,568,079 class instances and 18,944,925 relation instances. The homepages of 100,000 movies are downloaded randomly from as the target doc set. About 2,000,000 movie reviews are also downloaded serving as the external doc set. VR is our internal dataset acquired from the Toyota company. The vehicle repair ontology serves as the semantic knowledge base, which contains 2 classes, 3 types of relations, 27,963 class instances and 45,724 relation instances. The 6,109 symptom records in VR comprise the target doc set, and the 424,370 actual repair cases comprise the external doc set. To evaluate the document relevance, we perform a user study where 12 participants are asked to indicate the two-point Likert scale.To evaluate the performance of SCRD, we adopt the Precision and Recall metrics that are widely used. In addition, the rank sequence number comparison is also carried out. We perform the F -feature evaluation on the two test datasets to select the optimal parameter of G = {0, 0.05, 0.1, 0.15, ..., 1}. G =0.6 (G =0.55) produces the highest F value on IMDB (VR). To make sure that SCRD is not biased, we set their averaged value 0.575 as the final parameter value.A. Precision and Recall ComparisonTwo versions of SCRD are to be test. Version1 considers document relevance only for ranking, and Version2 employs our complete composite rank method where document importance is also incorporated to rank. The concept-based ranking approach [2] is utilized as baseline, which has been proved to gain much improvement on rank accuracy compared with the keyword-based search engines.F ig. 3 illustrates the Recall and Precision comparison results. When testing on IMDB, we run rank models on two target document sets separately: the homepage set (F ig.3(a)) and the review set (Fig.3(b)). Each document in the former set is well organized and written, which contains all-around and complete information about a movie. However, documents in the latter set are written by cinephile showing their viewingexperiences, which only involve part of the movie information.(a) Results on homepage set in IMDB(b) Results on review set in IMDB(c) Results on VRFigure 3. Recall and Precision comparisonAs Fig. 3(a) shows that our relevance-only ranking method (Version1) outperforms the baseline on precision about 20% on average and 28% at most when recall is 0.9, which illustrates that the semantic relation between concepts is an important factor that can't be ignored to help to match queries with documents in a more accurate way. However, the role of document importance in improving precision is not so obvious, and Version2 only outperforms the relevance only approach about 1.2% on average. After analyzing Fig. 3(b), we find that when searching on the review set, the baseline works much worse than it does on the homepage set (a fall of 9.9% on average) demonstrating that it is sensitive to the lacking of information. But our approach is more stable with precision reduced only by 3.8% on average, in other words, our relevance computing method is more robust than the baseline. F ig. 3(c) shows the comparison on VR. As we can see, the performance of Version1 on VR is much worse than it does on IMDB, where only 2.6% optimization on precision is achieved. That's because the semantic relations between concepts in the knowledge base of VR are so simple where only one relation links two concepts together in most cases, which leads score R to make weak contribution to distinguish docs. In conclusion, our relevance-only rank method is more robust and can generate more accurate results than the baseline does, especially when the relations between concepts are complex.B.Rank Sequence Number ComparisonTABLE I. R ANK S EQUENCE C ONPARISON ON IMDB Position expert baseline ver1ver2d exp, base d exp,ver1d exp,ver21 2 3 4 5 6 7 8 91011121314151617181920 12486591071415123131116171820193131110129812415141617185197620132134678121159101514161719182012345678910111213141516171819202117264195102134711211141125212253103723121141122244101411Avg.error 6.1 2.6 1.9 TABLE II. R ANK S EQUENCE C OMPARISON ON VR Position expert baseline ver1ver2d exp, base d exp,ver1d exp,ver21 2 3 4 5 6 7 8 91011121314151617181920 123465101498711121315181620171935426101516114131278119201719186451231171213159141081618192017123456789101112131415161718192023125528661554943215223421735822372213211362411121221Avg.error 3.7 3.3 1.4We choose one sample query randomly from each dataset, and compare the rank sequence of test models on them with that of experts in detail. The top-20 documents from the rank of Version2 are chosen as target ones to be re-ranked by experts and other models. Tableĉ and Ċ show the results. As we can see, in our experiments on both datasets, the rank sequence acquired by SCRD (Version2) is more close to that by the experts than the baseline does. Speaking concretely, according to the expert in tableĉ, the average ranking error for the baseline is 6.1, but it is 1.9 for SCRD. According to the expert in tableĊ, the average ranking error for the baseline is 3.7, but it is reduced to 1.4 by SCRD. In addition, based on the comparison on Version1 and Version2, we can see that although document importance takes unobvious effect in improving rank precision and recall, it plays a crucial role in putting documents in a more proper sequence, which reduces average errors effectively (1.9 on VR and 0.7 on IMDB). This validates that our importance calculating and utilizing method is reasonable. C.Time ComplexityThe most time consuming parts of our method are the query concept recognition, query intent understanding and the composite document rank processes whose performance affect the effectiveness of SCRD directly (table ċ). The computer we use is a windows-based PC with a Pentium(R) D CPU at 2.80GHz and 1 Gbytes of RAM. As we can see, in general, the promising results (36.2ms at most) over nearly 0.2 million documents demonstrate the feasibility of SCRD.TABLE III. E XPERIMENT R ESULTS OF T IME C OMPLEXITY ON B OTHIMDB AND VRq uerykeywordsquery concept recognitionquery intent understanding(ms)compositedocumentrank (ms)total delaysof SC RD(ms)1 0.1 8.9 9.72 1.2 9.5 11.53 3.1 11.3 15.14 8.5 11.1 20.35 23.2 12.2 36.2IV.C ONCLUSIONThis paper proposes SCRD for document ranking in ES environment. With the support of domain ontology, the semantic information implied in query (document), which is crucial for query intention understanding (document content understanding), is extracted and consumed for document relevance computing. Document importance is measured with the help of external doc sets through a semantic linkage-based approach, and then it is employed in conjunction with the document relevance for acquiring the final ranking list.R EFERENCES[1]J. Lee, J. Min and C.W. Chung, "An effective semantic search techniqueusing ontology," Proc. of the 18th Int'l Conf. World Wide Web, pp.1057-1058, 2009.[2]P. Cstells, M. F ernandez and D. Vallet, "An adaptation of the vector-space model for ontology-based information retrieval," IEEE Trans. on Knowledge and Data Engineering, vol.19, no.2, pp.261-272, 2007.[3]T. Hao and Z. Lu, "Categorizing and ranking search e ngine’s result bysemantic similarity," Proc.of the 2nd Int’l Conf. on Ubiquitous Information Management and Conmmunication, pp.284-288, 2008.[4]M. Daoud and M. Boughanem, "A personalized graph-based documentranking model using a semantic profile," Proc. of the 18th Int’l Conf. on User Modeling, Adaption and Personalization, pp.171-182, 2010.[5]S. Brin and L. Page, "The anatomy of large-scale hypertextual websearch engine," Proc. of the 7th Int'l Conf. World Wide Web, pp.107-117, 1998.[6]J. Kleinberg, "Authoritative sources in a hyperlinked environment," Proc.of ACM-SIAM Symposium on Discrete Algorithms, pp.668-677, 1998. [7]J.Q. Li, C.C. Liu, Y, Zhao and B. Liu, "A practical system for semanticinformation retrieval," unpublished.[8]F. Brauer, M. Huber and G. Hackenbroich, "Graph-based conceptidentification and disambiguation for enterprise search," Proc. of the 19th Int'l Conf. World Wide Web, pp.171-180, 2010.[9]G. Dalton and M. McGill, Introduction to Modern Information Retrieval,McGraw-Hill,1983.[10]J.Q. Li, Y. Zhao and B. Liu, "F ully automatic text categorization byexploiting WordNet," Proc. of the 5th Asia Information Retrieval Symposium, pp.1-12, 2009.[11]Z. Wang, G.Q. Liu and E.H. Chen, "A k-means algorithm based onoptimized initial center points," Journal of Pattern Recognition and Artificial Intelligence, vol.22, no.2, pp.299-304, 2009.。