基于Hadoop的分布式搜索引擎研究与实现
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
太原理工大学
硕士学位论文
基于Hadoop的分布式搜索引擎研究与实现
姓名:封俊
申请学位级别:硕士
专业:软件工程
指导教师:胡彧
20100401
基于Hadoop的分布式搜索引擎研究与实现
摘要
分布式搜索引擎是一种结合了分布式计算技术和全文检索技术的新型信息检索系统。它改变了人们获取信息的途径,让人们更有效地获取信息,现在它已经深入到网络生活的每一方面,被誉为上网第一站。
目前的搜索引擎系统大多都拥有同样的结构——集中式结构,即系统所有功能模块集中部署在一台服务器上,这直接导致了系统对服务器硬件性能要求较高,同时,系统还有稳定性差、可扩展性不高的弊端。为了克服以上弊端就必须采购极为昂贵的大型服务器来满足系统需求,然而并不是所有人都有能力负担这样高昂的费用。此外,在传统的信息检索系统中,许多都采用了比较原始的字符串匹配方式来获得搜索结果,这种搜索方式虽然实现简单,但在数据量比较大时,搜索效率非常低,导致用户无法及时获得有效信息。以上这两个缺点给搜索引擎的推广带来了很大的挑战。为应对这个挑战,在搜索引擎系统中引入了分布式计算和倒排文档全文检索技术。
本文在分析当前几种分布式搜索引擎系统的基础上,总结了现有系统的优缺点,针对现有系统的不足,提出了基于Hadoop的分布式搜索引擎。主要研究工作在于对传统搜索引擎的功能模块加以改进,对爬行、索引、搜索过程中的步骤进行详细分析,将非顺序执行的步骤进一步分解为两部分:数据计算和数据合并。同时,应用Map/Reduce编程模型思想,把数据计算任务封装到Map函数中,把数据合并任务封装到Reduce函数中。经过以上改进的搜索引擎系统可以部署在廉价PC构成的Hadoop分布式环境中,并具有较高的响应速度、可靠性和扩展性。这与分布式搜索引擎中的技术需求极为符合,因此本文使用Hadoop作为系统分布式计算平台。此外,系
统使用了基于倒排文档的全文检索技术,构建了以关键词为单位的倒排索引模块,同时结合TF-IDF和PageRank算法,改进了网页评分策略,优化了搜索结果。
最后,详细分析了在应用Map/Reduce编程模型实现系统模块过程中遇到的问题,及其解决方案。构建了一个4节点的小型分布式搜索引擎系统,通过对网络资源的爬行、索引和检索,以及对系统进行可靠性和扩展性测试,获得实验数据。在分析实验数据的基础上,验证了所提出的基于Hadoop 的分布式搜索引擎的合理性。
关键词: Map/Reduce,Hadoop,分布式计算,搜索引擎
THE RESEARCH AND IMPLEMENTATION OF
DISTRIBUTED SEARCH ENGINE
BASED ON HADOOP
ABSTRACT
Distributed Search Engine is a brand new information retrieval system which is consisted of distributed computing technology and full-text retrieval technology.It has changed the way of achieving informations for people and has made it more effectively. Now it has been deep into every aspects of the Internet, and it is known as the first Step of navigation.
At present, most of the search engine system are structured similarly - centralized structure, which means all of system’s modules are deployed on one server, and it also result in the server must be of high performance,meanwhile, the system still have poor stability and bad scalability. In order to deal with these disadvantanges, people have to purchase very large and expensive servers to satisfy the system requirements, however, not everyone have the ability to afford such high cost. In addition, a primitive string matching mode was adopted to gain the results in many traditional information retrieval systems. Although this method is simple, the search efficiency became very low when data volume is huge, and customers could not retrieve useful informations in time. The two disadvantages mentioned above was a big challenge to the promotion of search engine. In order to deal with this challenge, the technology of distributed computing and inverted document full-text retrieval were introduced into the search engine system.
In this paper, it summaried the advantages and disadvantages based on an analysis of several distributed search engine systems. In order to deal with the