软件工程毕业设计论文--基于Lucene与Heritrix的搜索引擎构建

合集下载

基于lucene的校园网搜索引擎

（信息处理模块。ｕｅｅ二）Ｌｃｎ开源检索框架制。通常，由词项（字）关键和出现情况两部分简介［Ｌ．２０－１２】ｈｔ：／ｗ．ｈｄ— ０】［０８１－４．ｔｐ／ｗｗｃｅｏ是基于文件索引机制的，只能对文本文件进行组成。对于索引中的每个词项（关键字）都跟ｎ．０／ｅｈ１ｃｎ．ｔ１，ｇｃｍｔｃ／ｕｅｅｈｍ．索引。信息处理模块主要包含３个步骤：读取随一个列表（位置表）用来跟踪记录单词在所【】薛宇星．基于ｔｒｔｉ和Ｌｃｎ的Ｗｂ，２ｉｉｒｘｕｅｅｅｅ页面内容、页面内容解析和构建索引。结构图有文档中出现过的位置。
分工不同，人们所关注的信息产业范围也不尽
个好的网页爬虫应该具有很好的灵活址为ｈｔ：ｗｗｊｏ．／ｔ／ｗ．ｓｔｎ，这是一套由ｉａｐ／ｅｆｃａ写息，ｖ搜索引擎起着至关重要的作用。当代社会
性和健壮性，并且易于管理员操作管理。灵活成的分析软件。
性旨在爬虫能够尽可能多的适用于各种不同
个焦点。
二、搜索引擎
搜索引擎是一个为用户提供信息检索功能的网络工具搜索引擎是随着互联网络信息的快速增长，开始逐步发展起来的技术。在互
联网发展的最初阶段，网站的数量相对较少，
爵…
ｔ３
．
！
．．．
．．．
信息查找比较容易。但随着互联网技术爆炸性的发展，网络上面的信息越来越多，并且以各种各样的形态存在，这时用户便很难找到所需要的信息，一些为满足大众信息检索需求的专业搜索网站就应运而出了。如今，ｏｇ的巨Ｇｏｌｅ

基于Lucene的搜索引擎系统设计与实现说明书

2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016)Research and implementation of search engine based on LuceneWan Pu, Wang LishaPhysics Institute, Zhaotong University,Zhaotong,Yunnan, PR ChinaPhysics InstituteZhaotong University, Zhaotong,Yunnan, PR China**************Keywords: Search engine; Lucene; web spider; Chinese word segmentationAbstract.From in-depth research on the basic principles and architecture of search engine, through secondary development for Lucene development package, this paper designs an entire search engine system framework, and realizes its core modules. This system can make up for the deficiency of the existing Lucene framework, and enhance the accuracy of the search engine system, so has higher real and commercial value.IntroductionWith the continuous expansion of network coverage and the development of network technology, the network information resources have been rapidly spread and increased. Large amounts of network information resources from all walks of life, including the information from different disciplines, different areas, different fields, different languages, very rich, and exist with text, images, audio, video, databases and other forms. Internet information has been hundreds of millions, so how to find the needed information from them has become a very important research topic in Internet technology. To help users find the information they need, the search engine came into being.Search engine is a search tool to help Internet users to query information, it is to collect, find information in the Internet with a certain strategy, and then understand, extract, organize and process the above information, thus providing users with search services, and information navigation purposes achieved. The advent of search engine for our fast, accurate and efficient access to network information resources provide great help. It is a web-based tool developed for the needs of people searching for network information, is the Internet information query navigation, and the bridge between users and network information.Principle analysis of search engineThe basic principle of search engine is to start from the existing resources, through their summary and link to determine the new information points needed to search for, and then by the relevant program search engine designed traverse these points, finally index, classify, and organize the documents on these points to the index database[1]. Logically this recursive traversal method can put all the information into the index database. When users use a search engine, enter the keywords of the content required to be found, the search program will read the information has been traversed and stored in the index database, to match with the user keyword, and then retrieve the corresponding or related information to output to the user through a certain organization method.A.Search engine system workflowA search engine to meet users needs is generally consists of information collection, informationFig.1 Search engine workflowInformation collection: web page collection obtains input from the URL database, to parse the address of a Web server in URL, establish a connection, send requests and receive data, and then store the obtained Web page data in the original page library, and from which to extract the link information to put into the page structure library, at the same time put the URL to be crawled into URL library, to ensure that the entire process is iterative, until URL library is empty[2]. Information preprocessing: after Web information collection, the preserved page information has been saved in a specific format. So the first step in this part is to index the original page, with that to provide the web page snapshot function for search engines; next Web page segmentation to the index page library, to transform each page into a set of a group of words; finally transform the mapping between Web page and index words into converse, to form inverted file, while gathering the unrepeated index words included in the Web page to be converging vocabulary; in addition, based on the structural information among Web pages to analyze the importance of the information, and then establish page meta-information. Retrieval service: the data delivered to the service stage includes index page library, inverted file and web page meta-information. Query agent to accept user input query phrase, after segmentation, retrieval from the index vocabularies and inverted file to get documents including query phrase, history log and other information, and then calculate the importance of the result set, finally sort and return to the user[3].Through the above several components, a search engine system can be built, when the user input the keywords, phrases related o the information and resources to be found, the system will traverse the program in accordance with its design search, from the Internet link address traverse pages, the results will be saved to the index database, and then process, integrate the indexed data, finally optimize of the results, according to certain priority algorithm to sort the results, and then store in the index database. When a user types a keyword the search engine will search for the matched page or data information from the index database, and in a certain way show it to the user through the user interface.B.Key technology of search engine systemA typical search engine structure generally consists of three modules: network spider, indexer, and searcher. Web spider generally first obtains URL from the URL queue to be visited, according to which get the page from Web and analyze it, to extract all of the URL links and add them to the page data URL queue to be accessed, at the same time move the visited URL to the visited URL queue[4]. Continuously repeat the above procedure. All the collected pages to be saved to store for further processing. Initially, in the URL queue only seed URL can be as a starting point the spider traverses the network, generally choose relatively large and very popular website address as a seed URL, because that such pages often have a lot of links to other pages. Web spiders use HTTP protocol to read Web pages and automatically access network resources along an HTML document hyperlink. You can use the network as a directed graph to deal with, each page as a node of it, and the page hyperlink as its directed edge. So you can takeFig.2 Web spider work principleIndexer function is to understand the searcher searched information, and extract the index entry, to be used to indicate documents and the index table to generate the documents. Generally index table uses some form of inverted list, that is, finds the corresponding documents from the index entries. Index table also may need to record the position of index entries appear in the document for indexer to calculate the adjacent or close relationship among the index entries. Indexer can use a centralized or distributed indexing algorithm[5]. The effectiveness of a search engine largely depends on the quality of index. When a user query is completed, the search engine has not real-time data retrieval on the web, and the searched data is actually the web data collected in advance. To achieve fast access to the collection page, it must be done through some sort of indexing mechanism. Page data can be represented by a series of keywords, and from the retrieval purposes they describe the content of the page. Just find the page, they can be found. Conversely, if the establishment of the page index is based on keywords, the relevant pages will be quickly retrieved. Specifically, keywords are stored in the index file, for each keyword there is a pointer list, in which each pointer directs to a page related to the keyword, and all pointer lists constitute placing file.The function of searcher is to quickly detect a document from the index database based on user query, evaluate the association of documents and queries, and then sort the results will be output, finally achieve some sort of user relevance feedback mechanism. The commonly used information retrieval models are set theory model, algebraic model, probabilistic model and hybrid model. Searcher is a module has a direct interaction with the user, and on the interface there are several implementations, the commonly used is Web mode, through these methods, the searcher receives a user query, and carries outword processing for it, finally obtains query keywords. Based on the above, the Web data matched with the query keyword will be obtained, and returned to the user after sorting[6].Search engine based on LuceneLucene is a full-text retrieval tool package based on Java, it is not a complete search application, but to provide indexing and search capabilities for applications. Currently Lucene is an open source project in the family of Apache Jakarta, also the most popular open full-text retrieval package based on Java, at present there are already many application search function is based on Lucene[7]. Lucene can establish indexing for the data with text type, so you just convert your index data format into text, Lucene will be able to index and search the document. For example, if some HTML documents, PDF documents need to be indexed, they must be first converted into text format, and then given to Lucene for indexing, next, the created index file is saved in disk or memory, finally according to the query criteria entered by the user query the index file. No specifying the format of the document to be indexed also makes Lucene is applicable for almost all of the search applications.C.Technical analysis of LuceneLucene architecture has strong object-oriented features. It first defines a platform-independent index file format, followed designs the core components of the system as abstract class, the concrete platform realization part as the achievement of the abstract class, in addition the platform-related part such as file storage is also packaged as a class, after object-oriented processing, finally a search engine system with low coupling, high efficiency, and easy secondary development is obtained. Lucene system structure is shown in figure 3.In Lucene file format, byte is the basis to define the data types, thus ensuring platform-independent, which is also the main reason for the Lucene index file format and platform independent. Lucene index consists of one or more segments, in which each segment composed by a number of documents. Document object can be treated as a virtual document: for example, a web page, an E-mail message or a text file, then you can retrieve large amounts of data. A Document object contains one or more fields with different domain name, and the field represents this document or some metadata related to it. Each field corresponds to a piece of data, and the data may be queried or retrieved in the index during the search process. The field consists of domain name and value. Term is a basic unit for the search, as field object, it includes a pair of string elements: respectively corresponding to the domain name and value. The conceptual structure of Lucene index files is shown in figure 4.Use of segments can quickly add new documents to the index through adding documents to a newly created index segment and only periodically merging with other existing paragraphs. This process increases the efficiency because that it minimizes the modification of the index file physically stored. One of the advantages of Lucene is to support incremental index. After adding a new document in the index, you can immediately search the contents of the document. Lucene supports for incremental index makes Lucene suit for the work environment of large amounts of information processing, in this environment the method of rebuilding index will look inefficient. Mapping to structure from concept, index is treated as a directory (folder), all the files contained in which are its contents, and these files are stored in group according to the different segments they belonged, the files in the same group have the same file name, different extension names. In addition there are three files, separately used to store the record of all the segments, save the record of deleted files, and control the synchronization of reading and writing, which are segments, deletable and lock files, with no extension names. Each segment contains a set of files, their file extension names are different, but the file names are all the names stored in the file segments.Lucene system structure has object-oriented feature. Developers do not need to know the internal structure and implementation of Lueene, but simply need to call application interfaces Lucene provided, and they also can extend their own needed functionality according to the actual situation. In the index, Lucene is different from the most search engines, while establishing the index create a new index file, for different update strategies, it combines the new index file with the existing index file, thus greatly improving the efficiency of the index. Lucene also have incremental indexing function, can make batch indexing, and optimize it, the incremental index with small quantities, so for large amounts of data index has obvious advantages. Lucene uses a common data structure to accept the index input, so can be flexibly adapted to a variety of data sources, such as databases, office documents, PDF documents and html documents, etc., when data indexing, only needs an appropriate parser to convert the data source into the corresponding data structure. Although Lucene has powerful search and indexing capabilities, but it is not a complete search engine, cannot collect the information of Internet pages, and in sorting have yet to be perfected[8]. The sorting of search results is very important for the search engine, usually users only take attention to the first page search engine returned, therefore, taking the pages valuable for users, with high level as the top surface of the page is an important topic of search engine study.D.Search engine based on LuceneSearch engine mainly consists of collecting, indexing, and retrieval system, while the user interface is a way to display search results for users. Web spider in the network according to a certain strategy to extract pages and recursively download the crawled pages. Indexing system for the pages the web spider have collected uses analysis system for word segmentations, then get the corresponding index entry, andfor all types of documents, uses the corresponding parser to parse the text, then index file and store it in the index database. Users input the search keyword through the user interface, and then the retrieval system will analyze it and submit it to the word segmentation system for processing, match the keywords obtained from the above processing with the words have been indexed, by specific algorithms sort the pages with same or similar keywords, finally return the search results to the user interface.The indexing mechanism in Lucene system should have analytical function, Lucene itself has the function to analyze txt, html files, and because of many Internet file formats, so in order to achieve a variety of document analysis, the corresponding search package needs to be added. Lucene analyzer consists of two parts: one part is the word segmentation device, being called Tokenizer; the other part is a filter, known as TokenFilter. A parser often consists of a word segmentation device and a plurality of filter, in which the filter is mainly used to deal with the segmented words. In the index establishment, what can be written in the index and retrieved by users are the entries. In fact, the so-called entry is the text after analyzer word segmentation and related processing. Word segmentation device through a next () method itself provided returns a primitive, segmented entry, and the filter through this method returns a filtered entry, with no segmentation function. As the filter constructor receives an instance of TokenStream, there will be two situations: first, the filter and other filters can be nested together to form a nested pipe filter structure; second, the filter can be combined with tokenizer to filter the segmented words from it. This nesting forms the core structure of Lucene analyzer.Retrieval function is the last link to achieve search engine, and the important factor to measure it in response speed and result sort. When a user enters a search keyword, the word segmentation system to analyze and cut, then the similarity calculation and matching with the morpheme vector in index database, and finally the search results successfully matched will be returned to the user. Retrieving part of the search engine system consists of Lucene search statement analysis system and search result clustering analysis system, in which the former is to understand the user input keywords, according to the reverse maximum matching algorithm for retrieve word segmentation, if the segmented results need to pause word filter, it needs to deal with the ambiguity field by using word segmentation probability, and then get the actual semantic words, establish a search term. Then, the Lucene search system query and submit the results to the clustering analysis system to analyze and process, so find high correlation pages and automatically generate pages. Finally, the analysis system will detect the similar documents from the Lucene search results.ConclusionThe rapid development of the Internet, the amount of information is increasing exponentially, but the ultimate goal is to enable users to easily access the information, this mission falls on the search engine, furthermore, how to return the needed information, web content with high quality to users, presents higher requirements and challenges to the search engine. Because that the Lucene scoring algorithm hasnot well reflected the page location information in the website, this paper designed an improved solution in index and retrieval module, which can well unite the basic points of the document, and the document location information in the website, as well as the document characteristics, to improve the accuracy of search result sorting, thereby enhancing the accuracy of the search.References[1]Monz C. Proceedings of 25th European Conference on Information Retrieval Research [c], Berlin/Heidelberg: Springer, 2003:571-579.[2]Nicholas Lester, Justin Zobel, Hugh E Williams. Efficient Online Index Maintenance for Contiguous Inverted Lists [J], Inf. Process. Manage, 2006, 42(4): 916-933.[3]George Samaras, Odysseas Papapetrou. Distributed Location Aware Web Crawling, In Proceedings of the 13th international World Wide Webconference [J], New York, USA：ACM Press, 2004: 468-469.[4]Hai Zhao, Changning Huang. Effective tag set selection in Chinese word Segmentation via conditional random field modeling [C], In: Proceedings ofPA-CL IC220.WuHan, November 123, 2006: 84-94.[5]Arvind Arasu, Jasmine Novak, Andrew Tomkins, John Tomlin. Page Rank Computation and the Structure of the WEB: Experiments and Algorithms,In Proceedings or 11th International World Wide Web Conference, 2002.[6]Giuseppe Antonio Di Lucca, Anna Rita Fasolino, Porfirio Tramontana. Reverse engineering web applications: the WARE Approach [J], Journal ofSoftware Maintenance and Evolution: Research and Practice, 2004, 11(3):15.[7]Giuseppe Pirro, Pomenico Talia. An approach to Ontology Mapping Based on the Lucene Search Engine Library[C], Proceedings of the 18thInternational Conference on Database and Expert Systems Applications, 2007, 9:156-158.[8]Laurence Hirsch, Robin Hirsch, Masoud Saeedi. Evolving Lucene Search Queries for Text Classification[C], Proceedings of the 9th AnnualConference on Genetic and Evolutionary Computation, 2007, 6(12):166.。

基于Lucene和Heritrix的全文检索引擎的研究与应用

ｕｓｎｇｉ．ｉｔＴｈｅｓａｃｎｇｍｅｈａｓｓｏｆＬｕｅｅｅａｌｉｎｄｔａｅｏｋｓｏｆＨｅｔｉｗｅｅｄｓｕｓｄｎｔｓｐａｒＡｎｄｆｎｌｙ，ｅｄｅｅｒｈｉｃｎｉｍｃｎｅｗｒｎａｙｓｓａｈｅｆｍｗｒｒｉｒｒｘｒｉｃｓｅｉｈｉｐｅ．ａｌｗ－ｉｖｅｏｄａａｐｌａｉｎｏｍａｅｐｓｕｙｏｒａｉｅｔｕｌｔｘｅｒｈｎｇｂａｅｎｌｐｅｎｐｉｔｏｔｋｅａｄｅｔｄｔｅｚｈｅｆｌｅｔｓａｃｉｓｄｏＬｕｃｎｅｃｌｅ．
ＱＩｉ — ｕＮＧＸｕ— ａｈ
（ｈｏｏＥｅｔｎｃＳｏｌｆｅｔｅｉｅｓｙＷｕａ３０３ＣｉａｃｒｌｒａＥｇｅｎ，ｈｎＴｘｉｖｒｔ，ｈｎ４０７，ｈｎ）ｃｉｎｉｌＵｎｉ
Ｋｅｏｄ：ｕｅｅｆｌｔｘａｈｎｇｅＨｅｔｘｙｗｒｓＬｃｎ；ｕｔｅｒｉｇｎｎ；ｒｒｌｅｓｃｅｉｉｉ
１述概
随着Ｉｔｔｎｅ网上的信息呈几何级数式的增长，ｍｅ搜索引擎已经成为用户浏览网络信息的首选。传统的通用搜索引擎（ｏｇ、ＧｏｌｅＹｈｏ以及国内的Ｂｉｕ等）作为一个辅助用户查找信息的工具已经成为大多数互联网用户访问网络的入口。但是，ａｏａｄ，这些通用性搜索引擎存在着一定的不足，例如：用搜索引擎的信息量较大、通搜索深度不够、询不太准确等问题。在这种情况下，了解决这些查为问题，垂直搜索引擎应运而生。垂直搜索引擎是针对某一领域或行业的专业搜索引擎，是搜索引擎的延伸，以为搜索用户提供符可合专业用户操作行为的信息服务方式。它的特点是 “ 、、，专精深” 并且具有较强的行业色彩，和通用搜索引擎的海量信息无序化相比，垂直搜索引擎更加具体和深入。该文主要阐述开源的Ｌｃｎ技术和Ｈｒｒｕｅｅｅｉｉｔｘ技术的基本原理和使用方法，出了整合Ｌｃｎ提ｕｅｅ与Ｈｅｉｉ使其与ＪＥｒｒｔｘ２Ｅ平台完全融合的方案，并实现了一个手机产品垂直搜索引擎系统。

基于Lucene和Heritrix的小型主题搜索引擎的研究及实现

基于Lucene和Heritrix的小型主题搜索引擎的研究及实现近年来互联网不断高速的发展,网络上的信息越来越繁杂。

光靠用户自己定位寻找信息已经越来越不可行,用户对信息搜索的需求越来越大。

而目前通用搜索引擎提供给用户的搜索结果往往掺杂了很多的不必要信息,用户开始寻求更准确的搜索专项内容的搜索引擎。

所以对专项搜索引擎技术的研究显得很有必要。

本文分析了搜索引擎的主要组成模块和实现的基本步骤,介绍了一些在搭建搜索引擎时需要的背景知识。

将构建专项搜索引擎拆分为数据搜集处理和数据搜索这两个主要的处理模块。

结合Heritrix的源代码和架构,研究并实现了数据搜集模块,包括url的解析和分配、多线程机制的实现等。

对Heritrix在面向专项内容进行搜索时的不足之处进行了原因分析,提出了具体改进的方法。

解决了包括仅针对专项网页内容进行url解析,针对爬虫多线程机制在单一网站搜集时失效等多个问题。

并给出了利用正则表达式对搜集完的数据信息进行预处理的方法。

结合Lucene信息检索工具包的源代码分析,实现了数据搜索模块。

并根据专项搜索的需求,定制了专门对返回的搜索结果进行进一步排序和过滤的机制。

针对Lucene工具包对中文的支持度不够,在对查询语句的关键词划分时,增加了一些对中文语言的优化支持。

在分析实现的过程中结合了具体的编程语言机制,说明了在该语言下实现时的一些注意事项。

最后示范了一个对某一网站中散文类别的文章进行专项搜集和对其搜索的主题搜索引擎的实现方法。

针对主题搜索引擎的主要功能点进行了相关的测试验证,并在最后根据其它的搜索查找原理对搜索结果进行了验证。

从最后的搜索的结果来看,准确取得了预期的搜索结果。

并在数据搜集阶段充分利用了多线程机制提升了搜集速度。

在研究过程中,也存在一些不足和缺陷。

比如没有采用分布式的机制去实现搜索。

对搜索引擎的用户界面没有优化,对用户不够友好。

后续会考虑采用Solr和DWR技术来实现一个友好的用户交互界面。

基于Lucene的全文搜索引擎的设计与实现

效性。
图１Ｌｃｎｕｅｅ系统的结构组织图
２Ｌｕｅｅ的系统结构分析ｃｎ
２２ｏｇａａｈ．ｃｎ．ｉｅ索引包是整个系统核心，．ｒ．ｐｃｅ［ｅｅｎｘｕｄ主要提供库的读写接口，过该包可以创建库．加删除记录及通添读取记录等。全文检索的根本就为每个切出来的词建立索引，查询时只需要遍历索引，不需要遍历整个正文，而极大地而从提高了检索效率，引创建的质量直接关系整个系统的质量。索Ｌｃｎ的索引树是非常优质高效的，这个包中，要有Ｉ．ｕｅｅ在主ｎ
查询结果。图１是Ｌｃｎｕｅｅ系统的结构组织图。２．分析器Ａｎｌｚｒ分析器主要用于切词，段文档输入１ａｙｅ一
以后，过Ａａｚｒ输出时只剩下有用的部分，他部分被剔经ｎｌｅ，ｙ其除。分析器提供了抽象的接口，因此语言分析（ｎｌ）Ａａ￣ｒ是可以ｙ定制的。因为Ｌｃｎ缺省提供了２个比较通用的分析器Ｓｕｅｅｉｍ．ｐＡａｓ和ＳａｄｒＡａｓｒ这２个分析器缺省都不支持中ｌｅｌｅｎｙｒｔｎａｄｎｌｅ，ｙ文，以要加入对中文语言的切分规则，要修改这２个分析所需

基于Lucene的搜索引擎设计与实现

ｅｐｅｓｏｏｇａｎｏａｏｘｒｓｉｎｔｒｂｉｆｒｔｎ，ＩｄｘｍｏｕｅｕｅｎｅｔｄｉｄｘｍｅｏＷｏｄｓｇｎａｏｇｒｔｍｓｓｍａｉｌｔｈＣｈｎｓｒｓｍｉｎｅｄｌｓｓｉｖｒｅｎｅｔｄ．ｒｅｍｅｔｔｎａｏｉｈｉｌｈｕｅｘｍａｌｍａｃｉｅｅｗｏｄｙ
整体上采用基于Ｓｒｓ．框架的模型．ｔｔ２ｕ１视图－控制器设计模式，据采集模块利用基于正则表达式的有限状态自动机抓取数据，索引模块应数
用倒排索引方法，系统的分词算法使用基于字典的正向最大匹配中文分词法。实验结果表明，方案具有较高的资源检索率，同时能够保该
第３卷第ｌ期７６
Ｖｏ．７１３
・
计
算
机
工
程
２１年８月０１
Ａｕｕｔ２１ｇｓ０１
Ｎｏ１．６
ＣｏｕｅＥｎｉｅｒｎｍｐｔｒｇｎｅｉｇ
软件技术与数据库・
文编ｔ０ — ４（１ｌ０９０章号０３８ｏ）— ０＿３文标码Ａｌｏ２２１６３＿献识・
ｅｓｒｈｅａｃｒｃｆｔｅｒｔｉｖｌｒｓｌｓｎｕｅｔｃｕａｙｏｈｅｒｅａｅｕｔ．
［ｅｏｄｌＦｌＴａｓｒｒｏｏＦＰｓｃｇｎ；ｕｅｅｒｗｒ；ｄｌｉｏｔｌｒＣ；ｎｅｔｅｕｏａ；ｖｒｄｘＫｙｒｓｉｒｆｏｃｌＴ）ｅｈｎｉｅＬｃｎａｏｋＭｏｅＶｅＣｎｏｌ（ｗｅｎｅＰｔ（ｒａｅｆｍｅｗｒｅＭＶ）ｉｔａｔｍｔｉｅｅｉｅｉｆｔｓａａｎｔｎｄＤＩ１．６／ｉｎ１０－４８０１６１Ｏ：０９９．ｓ．０３２．１．．３３ｊｓ０２１０

信息检索论文基于lucene的实验大学论文

基于Lucene的实验报告信息检索系统介绍信息检索系统是借助信息检索技术，如全文检索等手段帮助用户检索特定信息的工具。

它可以正确地表示，存储和组织信息，同时还提供信息的访问。

在这里，信息的概念是非常广泛的，它可以是一篇文章，一个文本，一个网页，一封电子邮件，一张照片，甚至是一个收集的虚拟信息。

检索的整个过程包括：文本数据库的构建、索引和检索。

信息检索的过程:1 建立一个文本库一个信息检索系统需要准备之前，搜索功能的开发。

首先，必须建立一个文本数据库。

该文本数据库用于存储用户可以检索的所有信息。

在此基础上，确定了检索系统中的文本模型。

文本模型是一种被系统识别的信息格式，具有冗余性低等特点。

当然，在系统的运行过程中，文本数据库的信息可能会不断变化。

2建立索引当您拥有文本模型时，您应该创建一个基于数据库中的文本的索引.。

索引可以大大提高信息检索的速度。

建立索引的方法有多种，这取决于信息检索系统的大小。

大规模的信息检索系统（如百度，谷歌，如搜索引擎）被用来创建一个倒排索引。

3搜索索引文本后，可以开始搜索它。

搜索请求通常由用户提交，请求进行分析，检索结果返回索引中。

Lucene随着系统信息的越来越多，怎么样从这些信息海洋中捞起自己想要的那一根针就变得非常重要了，全文检索是通常用于解决此类问题的方案，而Lucene则为实现全文检索的工具，任何应用都可通过嵌入它来实现全文检索。

Lucene是一个开源全文检索工具包，它是apache软件基金会jakarta项目组的一个子项目，是一个开放源代码的全文检索引擎工具包，即它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构，提供了完整的查询引擎和索引引擎，部分文本分析引擎（英文与德文两种西方语言）。

Lucene的目的是为软件开发人员提供一个简单易用的工具包，以方便的在目标系统中实现全文检索的功能，或者是以此为基础建立起完整的全文检索引擎。

Lucene工作方式lucene提供的服务实际包含两部分：一入一出。

基于Lucene的全文检索系统的设计与实现

基于Lucene的全文检索系统的设计与实现作者：张盼聂刚来源：《电脑知识与技术》2010年第01期摘要:Lucene是一个纯Java实现的高性能、可扩展的全文信息检索工具库,可以很方便地把它融入到应用程序中来增加索引和搜索功能。

该文分析了Lucene的索引机制,探讨了Heritrix 的结构框架,最后结合实际实例对基于Lucene的全文检索的应用进行深入研究。

关键词:Lucene;全文检索;Heritrix中图分类号:TP393.07 文献标识码:A 文章编号:1009-3044(2010)01-9-03Design and Implementation of Full-Text Searching System Based on LuceneZHANG Pan1, NIE Gang2(1.College of Information Engineering, Wuhan University of Science & Technology Branch, Wuhan 430073,China;2.College of Computer Science, Wuhan University of Science & Engineer, Wuhan 430073,China)Abstract: Lucene is an information retrieval library written in Java with its high performance and easy to scale. It can easily add indexing and searching capabilities to applications. The indexing mechanisms of Lucene were analysis and the frameworks of Heritrix were discussed in this paper. And finally, we developed an application to make a deep study to realize the full text searching based on Lucene.Key words: Lucene; full text search; Heritrix互联网搜索的使用水平可以反映全民的信息处理能力,几年前有研究发现美国用户比欧洲用户的互联网使用水平领先半年左右,主要是根据谁搜索时平均使用的关键词的个数多。

基于Lucene的中英文文档全文搜索引擎

电子科技大学硕士毕业论文３．１．２ＦＴＰＳｐｉｄｅｒ对于ＦＴＰｓｐｉｄｅｒ，系统出于运行效率的考虑选用了ｃ作为开发语言。

由于对不同ＦＴＰ站点的搜索是完全独立的过程，因此系统使用多线程／多进程＋阻塞ＩＯ模型。

在选择多线程还是多进程方案时，系统做了以下两方面考虑：１．在任务量繁重时，建立多个线程并不一定能够获得更高的效率，相反会加大ＣＰｕ的任务量，即使是在多ＣＰｕ操作系统中线程数也不应该超过ＣＰＵ的个数。

使用缸ｒｋ（）时会导致的大量ＣＰｕ占用，但是占用的时间是短暂的。

可见当待完成的任务量较大而系统设备性能较好时，可以通过创建多个进程在最短的时间内完成任务；当系统的内存较小而对实时性的要求不高时，倾向于创建多线程【５２】２．在现代的Ｌｉｎｕｘ中，不同进程之间的上下文切换所花的时间只比同一进程的线程之间相应的上下文切换多１５％。

时间上的花费所带来的回报是理解得更深刻的并且更健壮的编程模型。

基于以上两点原因系统采用了多进程的程序模型。

３．１．３分词系统分词系统使用了中科院的ＩＣＴＣＬＡＳ系统。

该分词系统开放源代码，是中国科学院计算技术研究所在多年研究基础上，开发研制出了基于多层隐马模型的汉语词法分析系统ＩＣＴＣＬＡＳ（ＩｎｓｔｉｔｕｔｅｏｆＣｏｍｐｕｔｉｎｇＴｅｃｈｎ０１０９ｙ，ＣｈｉｎｅｓｅＬｅｘｉｃａｌＡｎａｌｙｓｉｓＳｙｓｔｅｍ），该系统的功能有：中文分词；词性标注；未登录词识别。

分词正确率高达９７．５８％（最近的９７３专家组评测结果），基于角色标注的未登录词识别能取得高于９０％召回率，其中中国人名的识别召回率接近９８％，分词和词性标注处理速度为５４３．５ＫＢ／ｓ。

目前存在的分词系统中，除去商用分词系统，开源系统中ＩＣＴＣＬＡＳ的技术水平处于领先。

在中文信息检索中，分词效果是衡量一个搜索引擎的重要指标，是用户直接能够体验效果的项目之一，因此在自身技术水平不够成熟的前提下，系统借鉴了这套成熟的分词系统。

毕业设计论文--基于Lucene与Heritrix的搜索引擎构建

本科毕业设计（论文）基于Lucene与Heritrix的搜索引擎构建学院（系）：计算机科学与工程专业：软件工程学生姓名：学号：指导教师：评阅教师：完成日期：摘要在互联网蓬勃发展的今天，互联网上的信息更是浩如烟海。

人们在享受互联网带来的便利的同时，却面临着一个如何在如此海量的内容中准确、快捷地找到自己所需要的信息的问题，由此互联网搜索引擎应运而生。

本文在对搜索引擎的原理、组成、数据结构和工作流程等方面深入研究的基础上，对搜索引擎的三个核心部分即网络蜘蛛、网页索引和搜索的分析及实现过程进行阐述。

网络蜘蛛部分采用了基于递归和归档机制的Heritrix网络爬虫；网页索引部分利用开源的Lucene引擎架构设计并实现了一个可复用的、可扩展的索引建立与管理子系统；搜索部分在Ajax技术支持上，设计并实现了一个灵活、简洁的用户接口。

本系统具有抓取网页、建立和管理索引、建立日志以及搜索信息等功能，具备一定的应用前景。

关键词：搜索引擎；中文分词；索引The Construction of Search Engine Based on Lucene and HeritrixAbstractThe contents on the Web are increasing exponentially as the rapid development of the Internet. A problem how to obtain the useful information from vast contents quickly and accurately is facing us while people are enjoying the convenience of the Internet. The solver of this problem is Web Search Engine.The analysis and implementation process of three basic components of search engine(Crawler, Indexer and Searcher) is described in this paper on the basis of further study on the principles, composition, data structure and work flow of search engine. The crawler component is implemented with Heritrix crawler based on the mechanism of recursion and archiving; A reusable, extensible index establishment and management subsystem are designed and implemented by open-source package named “Lucene” in the indexer component; The Searcher component based on the Ajax technology is designed and realized as a flexible, concise user interface. The system has some functions, such as crawling web page, establishment and management index, establishment log and search information, it has a certain application prospect.Key Words：Search Engine；Chinese Word Segmentation；Index目录摘要 (I)Abstract (II)1 绪论 (1)1.1 项目背景 (1)1.2 国内外发展现状 (1)2 系统的开发平台及相关技术 (3)2.1 系统开发平台 (3)2.2 系统开发技术 (3)2.2.1 Heritrix网络爬虫简介 (3)2.2.2 Lucene技术简介 (4)2.2.3 Ajax技术简介 (4)3 系统分析与设计 (6)3.1 系统需求分析 (6)3.1.1 系统架构分析 (6)3.1.2 系统用例模型 (6)3.1.3 系统领域模型 (10)3.2 系统概要设计 (11)3.3 系统详细设计 (12)3.3.1 索引建立子系统 (13)3.3.2 用户接口子系统 (17)4 系统的实现 (18)4.1 系统包框架的构建 (18)4.1.1 索引建立子系统 (18)4.1.2 用户接口子系统 (19)4.2 系统主要功能实现 (19)4.2.1 索引建立子系统 (19)4.2.2 用户接口子系统 (22)结论 (24)参考文献 (25)致谢 (26)1 绪论1.1 项目背景1994年左右，万维网（world wide web）出现了。

基于lucene的校园网搜索引擎

基于lucene的校园网搜索引擎［提要］现代网络信息化水平日益提高，网页信息量急剧增加，搜索引擎已经成为人们获取所需知识的必要工具之一。

本文结合校园网搜索引擎的具体需求，介绍校园网搜索引擎的整体框架。

其中，lucene作为开源的检索框架，具有很好的应用性。

关键词：搜索引擎；lucene；网络爬虫；站内搜索一、引言21世纪是网络信息化的时代，网络信息已经成为人们工作与学习中不可或缺的东西。

网络在世界范围内向用户提供信息服务及其所拥有的信息资源，但随着网络的蓬勃发展，信息数量的快速增长，当今网络上的这些海量信息形态各异，且分散在网络中的各个角落。

因此，如何从网络上的海量信息中检索出用户所需要的信息，成为了我们关注的一个重要问题。

目前，虽然有了像Google、百度这样的通用搜索引擎，但是它们并不能适合人们所有的情况和需要，也没有哪个最大最好的搜索引擎可以覆盖所有的搜索范围，因为不同的人群范围所需求的信息资源也是不尽相同的。

人们习惯在互联网上查找信息，往往在同一个网站内拥有丰富的信息资源，如何在网站内部快速查找用户所要的信息，也成为了人们当前关注的一个焦点。

二、搜索引擎搜索引擎是一个为用户提供信息检索功能的网络工具。

搜索引擎是随着互联网络信息的快速增长，开始逐步发展起来的技术。

在互联网发展的最初阶段，网站的数量相对较少，信息查找比较容易。

但随着互联网技术爆炸性的发展，网络上面的信息越来越多，并且以各种各样的形态存在，这时用户便很难找到所需要的信息，一些为满足大众信息检索需求的专业搜索网站就应运而出了。

如今，Google的巨大成功让整个世界都把眼光投入到搜索引擎这个领域中，并且Google在一定程度上起到了引导作用。

Google公司在2007年决定向小型网站提供专门的搜索服务。

这些都表明小型专用的搜索引擎将在人们获取互联网信息中发挥着想当重要的作用。

三、luceneLucene是Apache软件基金会Jakarta项目组的子项目，它是一个开放源码的全文检索工具。

基于Lucene的全文搜索引擎

基于Lucene的全文搜索引擎
陈勇;张汉国;成筠
【期刊名称】《现代计算机（专业版）》
【年(卷),期】2009(000)011
【摘要】基于B/S模式的Java Web平台架构实现一个全文搜索引擎.该系统使用MySQL作为后台数据库,并采用Heritrix、Lucene等优秀的开源框架实现对某网页手机产品信息的检索.系统还利用Struts、Hdbernate、Spring等流行的Java 开发框架以及面向接口编程很好地实现对系统的解耦合,在前端使用具备较强UI表现功能的Extjs作为辅助实现了AJAX应用.
【总页数】4页(P134-137)
【作者】陈勇;张汉国;成筠
【作者单位】仲恺农业工程学院计算机科学与工程学院,广州,510225;仲恺农业工程学院计算机科学与工程学院,广州,510225;仲恺农业工程学院计算机科学与工程学院,广州,510225
【正文语种】中文
【相关文献】
1.基于Heritrix+Lucene的高校图书馆网站全文搜索引擎构建 [J], 华京生;李萍
2.基于Lucene的中文分词全文搜索引擎设计与实现 [J], 李炳练
3.基于Lucene和Heritrix的全文搜索引擎的设计与实现 [J], 张宣;刘晓飞
4.基于lucene和hibernate的站内全文搜索引擎 [J], 武卫国;潘清
5.基于 Lucene 的全文搜索引擎的设计与实现 [J], 胡嘉海
因版权原因，仅展示原文概要，查看原文内容请购买。

全文搜索引擎的设计与实现(文献综述)

全文搜索引擎的设计与实现前言面对海量的数字化信息，搜索引擎技术帮助我们在其中发现有价值的信息与资源。

我们可以通过google、百度这样的搜索引擎服务提供商帮助我们在Internet上搜索我们需要的信息。

但是在一些没有或不便于连入Internet的内部网络或者是拥有海量数据存储的主机，想要通过搜索来发现有价值的信息和资源却不太容易。

所以开发一个小型全文搜索引擎，实现以上两种情况下的信息高效检索是十分有必要的。

本设计着眼于全文搜索引擎的设计与实现，利用Java ee结合Struts,Spring,Hibernates以及Ajax等框架技术，实现基于apache软件基金会开源搜索引擎框架Lucene下的一个全文搜索引擎。

正文搜索引擎技术起源1990年，蒙特利尔大学学生Alan Emtage、Peter Deutsch和Bill Wheelan出于个人兴趣，发明了用于检索、查询分布在各个FTP主机中的文件Archie，当时他们的目的仅仅是为了在查询文件时的方便，他们未曾预料到他们的这一创造会成就日后互联网最的广阔市场，他们发明的小程序将进化成网络时代不可或缺的工具——搜索引擎。

1991年，在美国CERFnet、PSInet及Alternet网络组成了CIEA （商用Internet协会）宣布用户可以把它们的Internet子网用于商业用途，开始了Internet商业化的序幕。

商业化意味着互联网技术不再为科研和军事领域独享，商业化意味着有更多人可以接触互联网，商业化更意味着潜在的市场和巨大的商机。

1994年，Michael Mauldin推出了最早的现代意义上的搜索引擎Lycos，互联网进入了搜索技术的应用和搜索引擎快速发展时期。

以上是国际互联网和搜索引擎发展历史上的几个重要日子。

互联网从出现至今不过15年左右时间，搜索引擎商业化运作也就10年左右。

就在这短短的10年时间里，互联网发生了翻天覆地的变化，呈爆炸性增长。

基于Lucene和AJAX的搜索引擎的设计和实现-开题报告

信息工程学院
本科毕业设计
中文题目
基于Lucene和AJAX的搜索引擎的设计和实现
英文题目
Based on the Design and Implementation of theLuceneandAJAXSearch Engines
课题来源
教师自拟
课题类型
C—设计ห้องสมุดไป่ตู้论文
指导教师
何广军
学生姓名
赵玲
专业班级
2、研究的基本内容
本次毕业设计的系统主要采用ibatis框架上，用Spider抓取页面，通过Lucene的API对网页内容进行建索，AJAX技术处理用户的请求。同时加入maves完善系统。
系统前端：构建简单的检索平台。输入关键词，显示检索结果，用户通过平台完成与系统的交互，获得所需的资源。系统后台：在web上为系统抓取海量网页。之后解析网页，提取有用的内容建立词库。
计算机科学与技术专业11计科4班
1、选题的背景和意义
随着互联网的进步，网络资源不断的增加，如何有效的去发现我们所需要的信息，就成了一个关键问题。构建搜索引擎能很好地解决了这个问题。
为了满足用户更深层次的需求，国内的搜索引擎也在不断的完善自己。如何将人类的知识和智能加入到检索中，如何使搜索引擎的质量产生一个质的飞跃，也是国内搜索引擎努力的方向。中国网民对智能化搜索需求也是显而易见的。这也意味着搜索不再是简单的技术或者是网络导航而已，而是会成为普通人生活中必备的工具之一。
系统通过网络爬虫获得所需要的搜索的相关数据，用文本分析和信息提取对有效的信息进行存储和管理。
3、拟解决的主要问题
设计相关的页面如输入关键字的基础页面、搜索结果的页面。还有用Ajax对页面进行刷新，关联页面之间的跳转。

搜索引擎的设计与实现

二〇〇八年六月本科毕业设计说明书学校代码： 10128 学号： 040201015 题目：搜索引擎的设计与实现学生姓名：庞佳学院：信息工程学院系别：计算机专业：计算机科学与技术班级：计算机04-2 指导教师：苏依拉副教授钱庭荣工程师摘要为了适应网络信息的飞速增长，并且能够迅速、方便地从网络中获取有效信息, 搜索引擎逐渐走进了人们的生活，“竹竹”搜索引擎系统在这样的条件下，应运而生。

本文首先系统的介绍了搜索引擎的概念、发展历史、和搜索引擎的分类。

使读者能够初步了解搜索引擎技术。

然后，详细介绍了“竹竹”搜索引擎系统。

“竹竹”搜索引擎是基于Web的，面向笔记本电脑品牌的搜索引擎。

系统的前端以MVC模式来实现，Spring做中间层，JDBC作后端来开发实现的。

本系统分为三个子模块，抓取模块实现的功能为：将web上的海量网页抓取到系统中；采用的实现方法是使用Heritrix来完成对网页的抓取。

处理模块实现的功能为：解析网页，提取其中的有用内容，为网页建立词库，由于笔记本电脑的品牌名在现有词库中不存在，因此要建立其特有的词库文件，对解析网页生成的信息文件进行分词，并建立索引，将索引存入数据库中；采用的实现方法是：通过Lucene的API来实现对网页内容的建索，使用HTMLParser的API实现了对网页内容的解析。

用户模块实现的主要功能是：用户模块是系统的用户接口，用户通过此模块完成与系统的交互，当用户在查询界面上输入要检索的品牌信息后，系统将在可以接受的时间内，返回用户所需的结果集；采用的实现方法是：通过DWR封装了AJAX技术，处理用户请求；通过Lucene的API 来实现检索。

关键词：搜索引擎；Lucene；HeritrixAbstractIn order to adapt to the rapid growth of information networks, and can quickly and easily access to information from the network, search engines gradually come into people's lives, "zhuzhu" search engine system is builded in such conditions.This paper first introduced the system,the concept of search engines, the development of history, and search engines category. So that readers can understand the search engine technology. Then, details of the "zhuzhu" search engine system."zhuzhu" search engine is a Web-based, brand-oriented notebook computer search engine. The front-end system is made by model MVC, Spring to the middle layer, JDBC for the back-end . The system is divided into three sub-module, crawl module for the realization of the functions: Massive on the web page to crawl into the system; using the method is used to running Heritrix. Processing module for the realization of the functions: Analysis of the page, which extract useful content, pages thesaurus, because the brand of notebook computers available in the thesaurus does not exist, to establish its unique lexicon documents, analysis of the page Information generated by Word documents, and index, the index will be deposited in the database; method is used: Lucene API to achieve the content of the cable construction, the use of the API HTMLParser achieve the web content analysis. User module to achieve the main functions are: the user module is the user interface, the user through the completion of this module interactive system, when a user interface for input to the brand information retrieval system, the system will be acceptable time, Back to the user requirements set of results; using the method is: through the package the DWR AJAX technology, processing user requests through the Lucene API to achieve search.Key words: search engine; Lucene; Heritrix目录引言 (1)第一章课题背景 (2)1.1搜索引擎的概念 (2)1.2搜索引擎的发展历史 (3)1.2.1搜索引擎的起源 (3)1.2.2第一代搜索引擎 (3)1.2.3第二代搜索引擎 (3)1.2.4当前著名的搜索引擎简介 (4)1.3搜索引擎的分类 (5)1.3.1全文索引 (5)1.3.2目录索引 (5)1.3.3元搜索引擎 (5)1.3.4垂直搜索引擎 (6)1.3.5其他非主流搜索引擎形式 (6)第二章系统需求分析 (7)2.1搜索引擎的工作原理 (7)2.2系统功能需求 (7)2.3系统性能需求 (8)第三章系统总体设计 (9)3.1“竹竹”搜索引擎系统总体介绍 (9)3.2系统模块介绍 (11)3.2.1 模块功能介绍 (11)第四章系统详细设计 (16)4.1模块总体介绍 (16)4.2抓取子模块 (17)4.2.1运行Heritrix子模块 (17)4.2.2分析网页子模块 (22)4.3处理子模块 (26)4.3.1解析网页子模块 (26)4.3.2创建词库子模块 (27)4.3.3生成持久化类子模块 (27)4.3.4创建Document子模块 (28)4.4.5存储数据子模块 (32)4.4用户子模块 (32)4.4.1搜索页面 (33)4.4.2详细信息页面 (33)结论 (34)参考文献 (35)谢辞 (36)引言随着互联网的不断发展和日益普及，信息技术的不断发展,网上的信息量在爆炸性增长，这已经深入到了人们生活的各个方面，改变了人们生活方式和思维方式，方便了全球信息资源共享。

使用Heritrix和Lucence的全文检索解决方案

具有可扩展性，的各个组件都可以扩展，以很它可
方便的根据实际需求实现抓取逻辑．ｒｒｘ具Ｈｅｉｉｔ
用Ｌｃｎｌ来分析网页并建立索引、行检索．ｕｅｅ３进
奉课题的任务是对校园网站建立全文索引，所要完成工作主要有三个：提取网页内容、中文分词、建立索引库．根据校园网络的特性，以使用可网络爬虫每月做一次全网更新，天做一次登记每站点更新，这样能够很好地跟上校内各部门网站的更新速度，而能够有效地收集校内各站点的从内容数据．在对于校园网内容进行索引时，由索引程先序对使用网络爬虫抓取的网页进行分析，中得从到网页中的重要信息，要有网页的ＵＲ页面主Ｉ
第三项 “ ｅｃＷｒｅｓ内删除默认的 “ ｒ．ｒＳｌｔｅｉｒ” ｔｏｇａ—
ｃｉｅｃａｅ．ｉｒＡＲＷｒｅＰｏｅｓ，加ｈｖ．ｒｗｌｒｗｒｅ．Ｃｉｒｒｃｓ” 增ｔｔ
“ｒｏｇ．ａｃｖ．ｃａｅ．ｗｒｔｒｒｈｉｅｒｗｌｒｉｅ．ＭｉｒｒｒｔｒｏｒｏＷｉｅＰｒ —
ｈ咖曲ｅｄａ
ｍｅ－ｇｔｒａ￣
？ＨｆＰｅｄｒ１ｒｈａｅ．ｓ
始数据处理成一个高效的交差引用的查找结构以便于快速的搜索．ｕｅｅ提供的服务实际包含两Ｌｃｎ

基于Lucene的校园网搜索引擎的设计与实现

搜索引擎已成为互联网上不可或缺的工具．搜索引擎主要包括以下几个主要的模块：网络爬虫、引器、索检索器、户接口［．用４网络爬虫主要是］
信息量也迅速增加，仅依靠人工查询的方式在校仅园网查询所需要的信息不仅效率低下，而且费时费力．在互联网领域，文本信息的检索一直是大规模信息处理学科中的一个研究热点Ｌ，是网络多媒体１也］信息处理领域的重要研究方向．着对基于全文的随文本搜索技术的不断探索，索引擎技术在信息处搜
在系统的索引库进行信息检索，将搜索结果返回并给用户，同的搜索引擎的具体模块可能有不同的不
变化和扩展＿．５Ｊ
的搜索引擎像谷歌、度、虎等商业搜索引擎虽然百雅
搜索功能强大，同时也具有一些不足之处，公平但如
到文件库中．
化等步骤．文件信息过滤主要是将各种文件中无价值的字符串过滤掉；息抽取主要是从过滤后的文信件信息中提取文件标题和其他感兴趣的信息；建立索引库就是将所提取到的信息写入到索引文件中，索引文件是一种由词典（ｃｉａｙ和分块倒排列Ｄｉｏｒ）ｔｎ表（ｏｔｇｌｔ）成［；引优化主要是对索引文Ｐｓｉｓｓ组ｎｉ７索件进行优化，以提高系统的检索速度．由于Ｌｒｎ是以词为基础建立全文索引，￣ｅｅｃ因此，在建立索引之前必须进行中文分词，系统采用本中科院ｊ—ａａｙｉ－１５３工具包实现该功能．ｅｎｌｓｓ．．全

基于Lucene的搜索引擎设计与实现

1 搜索引擎的结构描述通常 ,一个搜索引擎由搜索器、索引器、检索器和用户
收稿日期 :2004 - 02 - 19 作者简介 :高琰 (1973 —) ,女 ,江苏宜兴人 ,博士研究生 ,研究方向为信息检索。
接口等四个部分组成[1 ] 。 a. 搜索器的功能是在互联网中漫游 ,发现和搜集信
0 前言在过去几年里 , Internet 的资源迅速增长 ,使 Web 发
展成为包含多种信息资源、站点遍布全球的海量信息服务网络。同时 ,也有越来越多的机构、团体和个人在 Internet 用搜索引擎查询信息。作为一个门户网站来说 ,提供给用户搜索服务 ,是吸引用户访问网站的重要手段。目前许多网站建立搜索引但是对于一个有很多子网站的企业门户网站来说 ,通用搜索引擎存在着很多缺陷 ,满足不了这种搜索服务要求 ,如 : 尽管 Google 等搜索引擎提供对指定站点内的查询 ,但是不能同时对多个站点同时查询 ;通用搜索引擎不能及时更新索引 ,会导致搜索结果不全和出现“坏链接”;调用通用搜索引擎的响应速度慢。因此研究一个由企业自主定制的搜索引擎 ,具有重要的意义。文中采用 Lucene 的开发工具包 ,实现了一个全文搜索引擎。
的搜索与索引策略及其相关参数都存在. xml 的配置文件中 ,可由系统维护人员通过该接口进行修改。
2) 文件内容分析器 :分析 HTML , PDF 等多种格式文
件 ,从中提取链接和文件各字段内容。文件的字段由开发
人员定义 ,这里定义了 url ,content Type ( 内容类型) 、last2
Modified( 最后修改日期) 、contents ( 内容) 、title ( 标题) 、
摘要 :当今搜索引擎已经成为人们在网上搜索信息的重要工具。通用的搜索引擎虽然功能强大 ,但对具有很多子网站的企业门户网站进行搜索时响应速度慢 ,索引范围不全。Lucene 是一个强大的全文索引引擎工具包 ,应用它可以快速地开发一个搜索引擎。文中描述了利用基于 Java 的全文检索工具包 Lucene 开发定制的中文搜索引擎方法 ,并且将该定制的搜索引擎与 Google 的站内搜索进行试验比较 ,发现在对具有很多子网站的企业门户网站进行搜索时有优于 Goo gle 的性能。关键词 : Web ;搜索引擎 ;Lucene 中图分类号 : TP391 . 3 文献标识码 :A 文章编号 : 1005 - 3751 (2004) 10 - 0027 - 04

基于Lucene的企业搜索引擎设计与实现

基于Lucene的企业搜索引擎设计与实现
谌怡丛;谢茜茜;刘启华
【期刊名称】《现代商贸工业》
【年(卷),期】2011(023)014
【摘要】现代企业信息化水平日益提高,长期积累下来的大量信息往往形态各异,且分散于企业网络或者员工电脑的各个角落,导致企业人员找到自己需要的数据变得十分困难,必须借助搜索引擎来解决这一难题.通过深入分析,研究Lucene的优点及其系统结构、数据流和索引结构,最后以Lucene为核心,结合ICTCLAS分词系统,成功构建了一个企业搜索引擎系统,实现了对PDF、Word、HTML这些非结构化数据的全文检索.
【总页数】3页(P218-220)
【作者】谌怡丛;谢茜茜;刘启华
【作者单位】武汉大学信息管理学院,湖北,武汉,430072;武汉大学信息管理学院,湖北,武汉,430072;武汉大学信息管理学院,湖北,武汉,430072
【正文语种】中文
【中图分类】F49
【相关文献】
1.基于Lucene的企业级搜索引擎的设计与实现 [J], 陈艳春;李双平
2.基于Lucene的新闻垂直搜索引擎设计与实现 [J], 许翰林;王瑞;王佳丽;吴宸阳;李浩;陈阳
3.基于Lucene的石墨烯中文文献搜索引擎设计与实现 [J], 肖显东;王勤生;杨永强;章国宝;
4.基于Lucene搜索引擎的涉恐信息检索模块设计与实现 [J], 彭世亮; 周欣; 卿粼波; 熊淑华; 何小海
5.基于Lucene搜索引擎的涉恐信息检索模块设计与实现 [J], 彭世亮; 周欣; 卿粼波; 熊淑华; 何小海
因版权原因，仅展示原文概要，查看原文内容请购买。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

本科毕业设计（论文）基于Lucene与Heritrix的搜索引擎构建摘要在互联网蓬勃发展的今天，互联网上的信息更是浩如烟海。

本系统具有抓取网页、建立和管理索引、建立日志以及搜索信息等功能，具备一定的应用前景。

它的开放性和其上信息广泛的可访问性极大的激励了人们创作的积极性。

在短短的十几年间，人类至少在万维网上发布了40亿条的网页，并且现在每天都有数以万计的数量增长。

由于网络上的资源与生俱来的数字化、网络化，这些特性成为了网络信息的双刃剑：一方面便于我们搜集整理，另一方面也会使我们感到太多蜂拥而至，无所适从。

而搜索引擎的由来要追溯到1990年加拿大的麦吉尔大学，麦吉尔大学的师生为了在分散在FTP资源上找到所需的东西，他们开发了软件Archie。

它通过定期搜索并分析FTP系统中存在的文件名信息，提供查找分布在各个FTP主机中文件的服务。

当Web风靡全球之后，以Web网页为对象的搜索引擎检索系统产生了。

成为人们上网找寻信息的重要手段，通过搜索引擎系统人们可以在浩如烟海的网络中瞬间找到想要找到的信息，并且搜索引擎的智能以及现在网页的特性使得人们只要输入相关的词语就可以找到直接相关的信息。

现今，Google的巨大成功让整个世界都把眼光投入到搜索引擎这个领域中。

仿佛一夜间，各种各样的搜索服务席卷而来。

从最初的Google、Yahoo到现今的Baidu、MSN、中搜、Sogou等，搜索引擎的品牌愈来愈多，服务也越来越丰富。

同时，伴随着Web2.0的普及，网络信息的膨胀速度呈指数急速增长，各种各样的网站都需要为其加入检索功能，以满足用户的需要。

另外，在企业级应用的市场上，全文信息检索的需求也一直在增加，各种文档处理、内容管理软件都需要键入全文检索的功能。

在这种背景下，搜索引擎的技术迅速发展。

各种讨论搜索的文章、杂志、论文铺天盖地，论坛和博客上也有许多相关帖子。

一时间，搜索引擎技术成为最热门的技术之一。

1.2 国内外发展现状网页是因特网的最主要的组成部份，也是人们获取网络信息的最主要的来源，为了方便人们在大量繁杂的网页中找寻自己需要的信息，这类检索工具发展的最快。

一般认为，基于网页的信息检索工具主要有网页搜索引擎和网络分类目录两种。

网页搜索引擎是通过“网络蜘蛛”等网页自动搜寻软件搜索到网页，然后自动给网页上的某些或全部字符做上索引，形成目标摘要格式文件以及网络可访问的数据库，供人们检索网络信息的检索工具。

网络目录则是和搜索引擎完全不同，它不会将整个网络中每个网站的所有页面都放进去，而是由专业人员谨慎地选择网站的首页，将其放入相应的类目中。

网络目录的信息量要比搜索引擎少得多，再加上不同的网络目录分类标准有些混乱，不便人们使用，因此虽然它标引质量比较高，利用它的人还是要比利用搜索引擎的人少的多。

但是由于网络信息的复杂性和网络检索技术的限制，这类检索工具也有着明显的不足：(1) 随着网页数量的迅猛增加，人工无法对其进行有效的分类、索引和利用。

网络用户面对的是数量巨大的未组织信息，简单的关键词搜索,返回的信息数量之大，让用户无法承受。

(2) 信息有用性评价困难。

一些站点在网页中大量重复某些关键字，使得容易被某些著名的搜索引擎选中，以期借此提高站点的地位,但事实上却可能没有提供任何对用户有价值的信息。

(3) 网络信息日新月异的变更，人们总是期望挑出最新的信息。

然而网络信息时刻变动，实时搜索几乎不可能，就是刚刚浏览过的网页,也随时都有更新、过期、删除的可能。

网络信息检索工具的发展主要体现在进一步改进、完善检索工具和检索技术，以提高检索服务质量，改变网络信息检索不尽如意的地方。

2 系统的开发平台及相关技术该系统开发需要J2EE和J2SE相关技术，开发平台要求合理、方便、快捷，开发环境的选取至关重要，当选取一种相对合理的开发平台时，会提高系统开发效率，并遵循以最低的消耗完成最有价值的工程这一原则。

2.1 系统开发平台本系统的开发平台如下表2.1所示：表2.1系统开发平台配置名称平台系统开发操作系统：Windows XP中文版系统开发数据库系统：SQL Server2000 个人版sp3系统开发前台页面设计：Macromedia Dreamweaver8.0J2EE服务端引擎：Tomcat6.0范围系统集成开发工具：MyEclipse5.5.1GAJava运行环境：JDK1.6.0_032.2 系统开发技术2.2.1 Heritrix网络爬虫简介Heritrix是一个由Java开发的、开源的Web网络爬虫，用户可以使用它从网络上抓取想要的资源。

Heritrix最出色之处在于它的可扩展性，开发者可以扩展它的各个组件，来实现自己的抓取逻辑。

Heritrix设计成严格按照robots.txt文件的排除指示和META robots标签。

Heritrix是IA的开放源代码，可扩展的，基于整个Web的，归档网络爬虫工程。

Heritrix工程始于2003年初，IA的目的是开发一个特殊的爬虫，对网上的资源进行归档，建立网络数字图书馆，在过去的6年里，IA已经建立了400TB的数据。

(1) Heritrix 1.0.0包含以下关键特性：①用单个爬虫在多个独立的站点一直不断的进行递归的爬。

②从一个提供的种子进行爬，收集站点内的精确URI和精确主机。

③主要是用广度优先算法进行处理。

④主要部件都是高效的可扩展的。

⑤良好的配置。

(2) Heritrix的局限：①单实例的爬虫，之间不能进行合作。

②在有限的机器资源的情况下，却要复杂的操作。

③只有官方支持，仅仅在Linux上进行了测试。

④每个爬虫是单独进行工作的，没有对更新进行修订。

⑤在硬件和系统失败时，恢复能力很差。

⑥很少的时间用来优化性能。

2.2.2 Lucene技术简介Lucene是apache软件基金会jakarta项目组的一个子项目，是一个开放源代码的全文检索引擎工具包，即它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构，提供了完整的查询引擎和索引引擎，部分文本分析引擎（英文与德文两种西方语言）。