全文检索系统论文

合集下载
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

摘要

中文全文检索系统是信息产业中发展较快的一个领域,而一个中文检索系统的核心就是索引器,本文介绍了索引器构造的不同算法模型,对相关的技术进行了比较,分析了各自的优缺点和实现难点,提出了一种中文全文检索中索引实现的数据结构和新型的算法模型。

本文首先综述了中文全文检索中索引构造的相关技术,主要包括索引文件数据结构、索引单位选取和索引压缩算法。

在上述综述的基础上,本文采用了基于单字的倒排表文件格式和可变字节编码压缩技术实现了整个索引系统。该系统包括三方面的功能分别是:文本预处理、索引创建和索引更新。在文本预处理部分实现了中文、外文和特殊字符的分离,同时实现了停止词(stopword)的删除。

在索引创建部分本文首先给出了一种基于传统倒排表的索引创建算法——合并排序式索引创建算法,该算法需要源文本10倍大小的临时空间。为了解决合并排序式索引创建算法临时空间过大的问题,本文提出了一种新的索引创建方案,该方案采用分级的倒排表索引组织结构和链式顺序混合存储的方式。它不仅不需要额外的临时空间,而且还提高了索引创建的效率。在索引创建的过程中本系统采用了可变字节编码压缩技术对索引进行压缩,实验表明该压缩算法将索引文件大小减少了20-30%。

在索引更新部分本文提出了三种顺序存储方式下准动态的索引更新策略,一种链式存储格式下索引动态更新的算法。该系统采用的链式存储结构下的索引更新算法复杂度达到了O(n)。

关键词:中文全文检索;索引器;倒排表;索引压缩

ABSTRACT

Chinese Full-Text Retrieval System is one of the fast developing fields in information industry , and the core of the Chinese retrieval system is the Index device. The paper analyzes several different algorithms of constructing the index device, and compares the related techniques, and then gives the advantages and disadvantages of each and the difficulty of achieving. Fnially this paper gives the data structure and a new algorithm model of The index in full-text retrieval system..

This paper first summarizes the related techniques of index constructing in Chinese Full-Text Retrieval, mainly includes data structure of document indexing, index compression algorithms.

The further way, this paper implements the entire index system using the setechniques, such as character based-on Inverted lists and the variable byte coding compression algorithm. This system includes three functions respectively is:Text pretreatment, index foundation and index up dating.

In the part of text pretreatment, has realized separation of Chinese, foreign and the Special character, and has realized deletion of "stopword".

In the part of index foundation, produces one kind index foundation algorithm based on traditional Inverted Lists——Sort-Merge method. This algorithm needs the 10 time of sizes for temporary spaces than the source text. Inorder to solve the problem of oversized temporary space in above algorithms, this paper proposed a new index foundation plan. The index organizational structure of this plan is improved Inverted lists, and its memory way is mix of chain ando rder. It not only does not need the extra temporary space, but also enhances the efficiency of index founding. In the process of index founding, using the invariable byte code compression technology to carry on the Compression of index, the experiment tindicates this compression algorithm reduced the size of index document 20-30%.

In the part of index renewal,this paper proposed three dynamic index updating strategies based on order memory, and a kind of index dynamic updating algorithm based on chain memory. The experiment indicates that index renewal algorithm complex has achieves O(n) based on chain memory.

KEYWORDS:Chinese Full-Text Retrieval;Index device;Inverted Lists;index

相关文档
最新文档