参加学术论坛---第一篇论文

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Improved webpage ranking algorithm based on Hadoop

SHI Lei-lei , SHI Hua-ji

(School of Computer Science and Telecommunication Engineering, Jiangsu University, Zhenjiang 212013, China)

Abstract: Aiming at the deficiency of Nutch webpage ranking algorithm, this paper added and improved PageRank algorithm, taking into account the user clicks, time feedback factor and subject content, Hadoop distributed cluster and successfully built three nodes to achieve distributed search engine system were found, the introduction of MapReduce programming model later, search engine crawling, indexing and retrieval efficiency and user query satisfaction has been improved to a certain extent. Keywords: Hadoop Cluster; MapReduce ; Nutch;PageRank Conclusion:

This paper improves the PageRank algorithm, into the webpage update time and the user clicks on the degree of factors in scoring algorithm, improveing the webpage ranking effect of Nutch, and the design and implementation of search engine system based on MapReduce in Hadoop cluster. Experiments show that: processing the large amount of data, the cluster number, search engine crawling, indexing and retrieval efficiency is higher.

References

[1]Dean J, Ghemawat S. Map/Reduce:Simplied Data Proc. On Large Clusters. OSD I 2004, San Francisco, 200 4,137-1501.

[2] Sergey Brin, Lawrence Page. The Anatomy of a Large-Scale Hypertextual Web Search puter Networks and ISDN System, 1998,30(1-7):107-17.

[3] The Google Search Engine: Commercial Search Engine founded by the Originators of PageRank,/, 2003.

[4] 王春花,朱俊平,改进的非平均传递权值PageRank 算法[J],计算机工程与设计,2010,31 (10):2231-2234.

[5] 王崝,鞠时光,基于时间维加权TimedWPR算法[J],计算机工程与设计,2008,29(12):3001-3004.

[6] 段淮川,胡平,基于主题特征和时间因子的改进PageRank 算法[J],计算机工程与设计,2010,31 (4):866-868.

[7] 黄德才,戚华春,PageRank算法研究[J],计算机工程与设计,2006,32(4):145-146.

[8] 戚华春,黄德才等,具有时间反馈的PageRank改进算法[J],浙江工业大学学报,2005,33(3):272-275.

[9] 杨格兰,涂立,基于主题相关性和链接权重的PageRank算法[J],华中科技大学学报,2012,40(1):300-303.

[10] 郭庆宝,贾代平,融合反馈信息与内容相关度的PageRank 改进算法[J],计算机工程与设计,2011,32(12):4071-4074.

[11]J.Dean,S.Ghemawat,"MapReduce:simplified Data Processing on Large Clusters". Proc. of Operating Systems Design and Implementation, San Francisco,CA, pp. 137-150 (2004) .

[12] 潘涛,梁正友,Nutch中网页排序效果的改进方法[J],计算机工程,2010,36(13):42-44.

[13] 王学松.Lucene+Nutch搜索引擎开发[M].北京:人民邮电出版社,2008:368-375.

相关文档
最新文档