网络爬虫外文翻译

合集下载

浅析Python网络爬虫

浅析Python网络爬虫

浅析Python网络爬虫作者:陈超来源:《教育周报·教育论坛》2019年第46期摘要:网络爬虫(Web Spider)又称网络蜘蛛、网络机器人,是一段用来自动化采集网站数据的程序。

如果把互联网比喻成一个蜘蛛网,那么Spider就是在网上爬来爬去的蜘蛛。

Python适用于网站、桌面应用开发,自动化脚本,复杂计算系统,科学计算,生命支持管理系统,物联网,游戏,机器人,自然语言处理等很多方面。

本文简要介绍对于定向信息采集所需了解基本知识和相关技术,以及python中与此相关的库,同时提供对与数据抓取有关库的封装实现。

一、应用场景爬虫技术在科学研究、Web安全、产品研发、舆情监控等领域可以做很多事情。

如:在数据挖掘、机器学习、图像处理等科学研究领域,如果没有数据,则可以通过爬虫从网上抓取;在Web安全方面,使用爬虫可以对网站是否存在某一漏洞进行批量验证、利用;在产品研发方面,可以采集各个商城物品价格,为用户提供市场最低价;在舆情监控方面,可以抓取、分析微博的数据,从而识别出某用户是否为水军。

二、运行流程对于定向信息的爬取,爬虫主要包括数据抓取、数据解析、数据入库等操作流程。

其中:(1)数据抓取:发送构造的HTTP请求,获得包含所需数据的HTTP响应;(2)数据解析:对HTTP响应的原始数据进行分析、清洗以提取出需要的数据;(3)数据入库:将数据进一步保存到数据库(或文本文件),构建知识库。

三、相关技术爬虫的相关技术包括:(1)数据抓取:了解HTTP请求和响应中各字段的含义;了解相关的网络分析工具,主要用于分析网络流量,如:burpsuit等。

一般情况,使用浏览器的开发者模式即可;(2)数据解析:了解HTML结构、JSON和XML数据格式,CSS选择器、Xpath路径表达式、正则表达式等,目的是从响应中提取出所需的数据;(3)数据入库:MySQL,SQLite、Redis等数据库,便于数据的存储;以上是学习爬虫的基本要求,在实际的应用中,也应考虑如何使用多线程提高效率、如何做任务调度、如何应对反爬虫,如何实现分布式爬虫等等。

crawller

crawller

简化版网络爬虫(Web Crawler)一、基本部分你要编写一个通过HTTP协议与Web服务器通信的程序。

例如浏览器就是一个能够和Web服务器通信的HTTP客户端。

本项目的目标是实现一个简单的HTTP客户端程序,能够自动的通过用户输入的关键字来抓取网页。

你编写的自动的网页爬虫称为KeywordHunter。

项目的具体描述:运行参数:StartURL:爬虫开始抓取的URL。

例如:80/index.html,默认端口号为80。

SearchKeyword:爬虫机器人寻找的关键字。

最多100个字符。

如果当爬取深度为5时仍然没有找到关键字,则认为机器人找不到。

深度的意思是这样定义的:StartURL定义深度为1,StartURL中的所有链接集合U的深度为2,集合U中的所有链接的深度为3,依次类推。

OutputDir:如果这个参数不为空,则所有抓取到的网页需要保持到OutputDir路径下。

KeywordHunter的可执行文件命名必须为KeywordHunter 。

你的程序运行命令应该像这样:KeywordHunter /index.html FindMeTxt OutputDir或者KeywordHunter /index.html FindMeTxtExit Code:当程序运行没有错误时,返回0。

当发生例如无效命令参数时,返回1。

程序正确运行,但关键字没有找到不算错误。

KeywordHunter 必须使用HTTP Get请求来获取StartURL的内容。

然后开始搜索关键字,如果发现关键字,则KeywordHunter 报告该page的URL和关键字所在的行,然后停止运行。

如果没有找到关键字,则抓取当前页面的所有链接,进行层次搜索。

该程序运行停止条件为:1、找到了关键字;2、当搜索到深度为5时,仍未找到关键字。

KeywordHunter 需要从已经存在的HTTP服务器上面爬取页面,例如:/index.html,伯克利学习的服务器。

website extractor使用方法

website extractor使用方法

website extractor使用方法1. 引言1.1 什么是website extractorWebsite Extractor是一种用于提取网站数据的工具,它能够自动化地从网页中抓取所需的信息,并将其转化为结构化数据。

通过使用Website Extractor,用户可以快速准确地收集大量网站上的数据,而无需手动复制粘贴或者浏览多个页面。

这个工具通常使用在数据挖掘、市场调研、竞争分析等领域,能够帮助用户节省大量时间和精力。

Website Extractor利用网络爬虫技术,可以访问并解析网页上的各种信息,如文本、图片、链接等。

用户可以通过设定特定的规则和筛选条件,来提取他们感兴趣的数据,并将其保存或导出到本地文件或数据库中。

这种工具通常具有界面友好,操作简单的特点,让用户可以快速上手并开始进行数据提取工作。

Website Extractor是一种强大的数据采集工具,能够帮助用户轻松获取网站上的信息,提高工作效率。

通过合理的配置和使用,用户可以满足各种网站数据提取需求,从而得到更多有用的信息和见解。

1.2 website extractor的作用1. 网站内容获取:Website extractor可以帮助用户快速准确地从网站中抓取所需的信息,无需手动复制粘贴,大大提高了工作效率。

2. 数据分析:通过使用website extractor,用户可以轻松地对提取的数据进行分析和处理,从而获取更多有用的信息和洞察。

4. 市场研究:对于市场研究人员来说,使用website extractor可以快速获取市场上的信息,帮助他们更好地制定营销策略和决策。

website extractor的作用在于帮助用户快速准确地从网站中提取数据,进行数据分析和处理,帮助用户更好地了解市场和竞争情况,从而帮助他们做出更明智的决策。

2. 正文2.1 website extractor的安装步骤1. 下载安装程序:需要从官方网站或其他可信任的来源下载website extractor的安装程序。

搜索引擎Web Spider(蜘蛛)爬取的原理分享

搜索引擎Web Spider(蜘蛛)爬取的原理分享

搜索引擎Web Spider(蜘蛛)爬取的原理分享一、网络蜘蛛基本原理网络蜘蛛即WebSpider,是一个很形象的名字。

把互联网比方成一个蜘蛛网,那么Spider就是在网上爬来爬去的蜘蛛。

网络蜘蛛是通过网页的链接地址来寻觅网页,从网站某一个页面(通常是首页)开头,读取网页的内容,找到在网页中的其它链接地址,然后通过这些链接地址寻觅下一个网页,这样向来循环下去,直到把这个网站全部的网页都抓取完为止。

假如把囫囵互联网当成一个网站,那么网络蜘蛛就可以用这个原理把互联网上全部的网页都抓取下来。

对于搜寻引擎来说,要抓取互联网上全部的网页几乎是不行能的,从目前公布的数据来看,容量最大的搜寻引擎也不过是抓取了囫囵网页数量的百分之四十左右。

这其中的缘由一方面是抓取技术的瓶颈,薹ū槔械耐常行矶嗤澄薹ù悠渌车牧唇又姓业剑涣硪桓鲈蚴谴娲⒓际鹾痛砑际醯奈侍猓绻凑彰扛鲆趁娴钠骄笮∥0K计算(包含),100亿网页的容量是100×2000G字节,即使能够存储,下载也存在问题(根据一台机器每秒下载20K计算,需要340台机器不停的下载一年时光,才干把全部网页下载完毕)。

同时,因为数据量太大,在提供搜寻时也会有效率方面的影响。

因此,许多搜寻引擎的网络蜘蛛只是抓取那些重要的网页,而在抓取的时候评价重要性主要的依据是某个网页的链接深度。

在抓取网页的时候,网络蜘蛛普通有两种策略:广度优先和深度优先(如下图所示)。

广度优先是指网络蜘蛛会先抓取起始网页中链接的全部网页,然后再挑选其中的一个链接网页,继续抓取在此网页中链接的全部网页。

这是最常用的方式,由于这个办法可以让网络蜘蛛并行处理,提高其抓取速度。

深度优先是指网络蜘蛛会从起始页开头,一个链接一个链接跟踪下去,处理完这条线路之后再转入下一个起始页,继续跟踪链接。

这个办法有个优点是网络蜘蛛在设计的时候比较简单。

两种策略的区分,下图的解释会越发明确。

python网络爬虫

python网络爬虫

1、网络爬虫的定义网络爬虫,即Web Spider,是一个很形象的名字。

把互联网比喻成一个蜘蛛网,那么Spider就是在网上爬来爬去的蜘蛛。

网络蜘蛛是通过网页的链接地址来寻找网页的。

从网站某一个页面(通常是首页)开始,读取网页的内容,找到在网页中的其它链接地址,然后通过这些链接地址寻找下一个网页,这样一直循环下去,直到把这个网站所有的网页都抓取完为止。

如果把整个互联网当成一个网站,那么网络蜘蛛就可以用这个原理把互联网上所有的网页都抓取下来。

这样看来,网络爬虫就是一个爬行程序,一个抓取网页的程序。

网络爬虫的基本操作是抓取网页。

那么如何才能随心所欲地获得自己想要的页面我们先从URL开始。

2、浏览网页的过程抓取网页的过程其实和读者平时使用IE浏览器浏览网页的道理是一样的。

比如说你在浏览器的地址栏中输入这个地址。

打开网页的过程其实就是浏览器作为一个浏览的“客户端”,向服务器端发送了一次请求,把服务器端的文件“抓”到本地,再进行解释、展现。

HTML是一种标记语言,用标签标记内容并加以解析和区分。

浏览器的功能是将获取到的HTML代码进行解析,然后将原始的代码转变成我们直接看到的网站页面。

3、URI的概念和举例简单的来讲,URL就是在浏览器端输入的这个字符串。

在理解URL之前,首先要理解URI的概念。

什么是URIWeb上每种可用的资源,如HTML文档、图像、视频片段、程序等都由一个通用资源标志符(Universal Resource Identifier,URI)进行定位。

URI通常由三部分组成:①访问资源的命名机制;②存放资源的主机名;③资源自身的名称,由路径表示。

我们可以这样解释它:①这是一个可以通过HTTP协议访问的资源,②③通过路径“/html/html40”访问。

4、URL的理解和举例URL是URI的一个子集。

它是Uniform Resource Locator的缩写,译为“统一资源定位符”。

通俗地说,URL是Internet上描述信息资源的字符串,主要用在各种WWW 客户程序和服务器程序上。

网络爬虫ppt课件

网络爬虫ppt课件
12
13
工作流程
网络爬虫基本架构如图所示,其各个部分的主要功能介 绍如下:
1.页面采集模块:该模块是爬虫和因特网的接口,主 要作用是通过各种 web 协议(一般以 HTTP.FTP 为主 )来完成对网页数据的采集,保存后将采集到的页面交 由后续模块作进一步处理。
其过程类似于用户使用浏览器打开网页,保存的网页供 其它后续模块处理,例如,页面分析、链接抽取。
8
爬虫基本原理
而且对于某些主题爬虫来说,这一过程所得到 的分析结果还可能对以后抓取过程给出反馈和 指导。正是这种行为方式,这些程序才被称为 爬虫( spider )、crawler、机器人。
9
爬虫基本原理
Spider怎样抓取所有的 Web 页面呢? 在 Web 出 现 以 前 , 传 统 的 文 本 集 合 , 如 目 录 数
5
垂直搜索的本质
从主题相关的领域内,获取、加工与搜索行 为相匹配的结构化数据和元数据信息。
如数码产品mp3:内存、尺寸、大小、电池型号、价格、生产 厂家等,还可以提供比价服务
6
爬虫基本原理
网络爬虫是通过网页的链接地址来寻找网页, 从一个或若干初始网页的URL开始(通常是某 网站首页),遍历 Web 空间,读取网页的内容 ,不断从一个站点移动到另一个站点,自动建 立索引。在抓取网页的过程中,找到在网页中 的其他链接地址,对 HTML 文件进行解析,取 出其页面中的子链接,并加入到网页数据库中 ,不断从当前页面上抽取新的URL放入队列, 这样一直循环下去,直到把这个网站所有的网 页都抓取完,满足系统的一定停止条件。 7
随着抓取的进行,这些未来工作集也会随着膨胀, 由写入器将这些数据写入磁盘来释放主存,以及避 免爬行器崩溃数据丢失。没有保证所有的 Web 页 面的访问都是按照这种方式进行,爬行器从不会停 下来,Spider 运行时页面也会随之不断增加。

网络爬虫

网络爬虫

工作流程
3、链接过滤模块:该模块主要是用于对重复链 接和循环链接的过滤。例如,相对路径需要补 全 URL ,然后加入到待采集 URL 队列中。 此时,一般会过滤掉队列中已经包含的 URL , 以及循环链接的URL。
工作流程
4.页面库:用来存放已经采集下来的页面,以 备后期处理。 5.待采集 URL 队列:从采集网页中抽取并作 相应处理后得到的 URL ,当 URL 为空时爬虫 程序终止。 6.初始 URL :提供 URL 种子,以启动爬虫
URL 的搜索策略
深度优先搜索沿着 HTML 文件上的超链走到不能再深 入为止,然后返回到某一个 HTML 文件,再继续选择 该 HTML 文件中的其他超链。当不再有其他超链可选 择时,说明搜索已经结束。 这个方法有个优点是网络蜘蛛在设计的时候比较容易。
使用深度优先策略抓取的顺序为:A-F-G、E-H-I、B、 C、D 。 目前常见的是广度优先和最佳优先方法。
URL 的搜索策略
另外一种方法是将广度优先搜索与网页过滤技术结合使 用,先用广度优先策略抓取网页,再将其中无关的网页 过滤掉。这些方法的缺点在于,随着抓取网页的增多, 大量的无关网页将被下载并过滤,算法的效率将变低。
使用广度优先策略抓取的顺序为:A-B、C、D、E、F-G 、H-I 。
URL 的搜索策略
工作流程
2.页面分析模块:该模块的主要功能是将页面采集模 块采集下来的页面进行分析,提取其中满足用户要求的 超链接,加入到超链接队列中。 页面链接中给出的 URL 一般是多种格式的,可能是完 整的包括协议、站点和路径的,也可能是省略了部分内 容的,或者是一个相对路径。所以为处理方便,一般进 行规范化处理,先将其转化成统一的格式。
爬虫基本原理

网络爬虫外文翻译参考文献

网络爬虫外文翻译参考文献

网络爬虫外文翻译参考文献(文档含英文原文和中文翻译)译文:探索搜索引擎爬虫随着网络难以想象的急剧扩张,从Web中提取知识逐渐成为一种受欢迎的途径。

这是由于网络的便利和丰富的信息。

通常需要使用基于网络爬行的搜索引擎来找到我们需要的网页。

本文描述了搜索引擎的基本工作任务。

概述了搜索引擎与网络爬虫之间的联系。

关键词:爬行,集中爬行,网络爬虫1.导言在网络上WWW是一种服务,驻留在链接到互联网的电脑上,并允许最终用户访问是用标准的接口软件的计算机中的存储数据。

万维网是获取访问网络信息的宇宙,是人类知识的体现。

搜索引擎是一个计算机程序,它能够从网上搜索并扫描特定的关键字,尤其是商业服务,返回的它们发现的资料清单,抓取搜索引擎数据库的信息主要通过接收想要发表自己作品的作家的清单或者通过“网络爬虫”、“蜘蛛”或“机器人”漫游互联网捕捉他们访问过的页面的相关链接和信息。

网络爬虫是一个能够自动获取万维网的信息程序。

网页检索是一个重要的研究课题。

爬虫是软件组件,它访问网络中的树结构,按照一定的策略,搜索并收集当地库中检索对象。

本文的其余部分组织如下:第二节中,我们解释了Web爬虫背景细节。

在第3节中,我们讨论爬虫的类型,在第4节中我们将介绍网络爬虫的工作原理。

在第5节,我们搭建两个网络爬虫的先进技术。

在第6节我们讨论如何挑选更有趣的问题。

2.调查网络爬虫网络爬虫几乎同网络本身一样古老。

第一个网络爬虫,马修格雷浏览者,写于1993年春天,大约正好与首次发布的OCSA Mosaic网络同时发布。

在最初的两次万维网会议上发表了许多关于网络爬虫的文章。

然而,在当时,网络i现在要小到三到四个数量级,所以这些系统没有处理好当今网络中一次爬网固有的缩放问题。

显然,所有常用的搜索引擎使用的爬网程序必须扩展到网络的实质性部分。

但是,由于搜索引擎是一项竞争性质的业务,这些抓取的设计并没有公开描述。

有两个明显的例外:股沟履带式和网络档案履带式。

关于爬虫的外文文献

关于爬虫的外文文献

关于爬虫的外文文献全文共四篇示例,供读者参考第一篇示例:Web crawler, also known as web spider, web robot, or simply crawler, is a program or automated script that browses the World Wide Web in a methodical and automated manner. It is used to discover and retrieve information from websites across the Internet. In this article, we will discuss the key concepts and functionalities of web crawlers.1. Introduction to Web Crawlers:A typical web crawler consists of the following components:Web crawling is a complex and challenging task due to the dynamic and ever-changing nature of the web. Some of the key challenges faced by web crawlers include:第二篇示例:AbstractThis article provides an overview of web crawling, including its applications, challenges, and best practices. It also discusses the ethical considerations that come with crawling websites andthe legal implications of scraping data from websites without permission.IntroductionWeb crawling is commonly used for a variety of purposes, such as:Challenges of Web Crawling- Politeness: Web crawlers need to be polite and respectful when accessing websites to avoid overloading servers and getting blocked.- Dynamic content: Websites with dynamic content, such as JavaScript or AJAX, can be difficult to crawl as the content may not be readily accessible.- Authentication: Crawling websites that require user authentication can be challenging as bots may not have the necessary credentials to access the content.- Data extraction: Extracting data from websites in a structured format can be challenging, especially when dealing with unstructured or poorly formatted data.- URL management: Managing a large number of URLs and ensuring that all links are crawled efficiently can be a daunting task.Ethical ConsiderationsLegal Implications第三篇示例:Spider silk is a unique material that is strong, elastic and lightweight. It is made up of proteins and has a complex molecular structure that gives it its unique properties. Spider silk has been used in various applications, such as medical sutures, fishing lines, and even bulletproof vests. Scientists have been trying to replicate the properties of spider silk in the lab, but it is a challenging task due to the complexity of the material.第四篇示例:Title: A Study of Web Crawling TechnologyWeb crawling is the process of automatically browsing the Internet to retrieve and index information from web pages. This technology is widely used in various fields, including search engine optimization, data mining, market research, and competitive intelligence. By collecting and analyzing data fromdifferent websites, web crawlers can provide valuable insights and help users make informed decisions.。

外文翻译--网络爬虫

外文翻译--网络爬虫

Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test.A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.1. INTRODUCTIONA recent Pew Foundation study [31] states that “Search engines have become an indispensable utility for Internet users” and estimates that as of mid-2002, slightly over 50% of all Americans have used web search to find information. Hence, the technology that powers web search is of enormous practical interest. In this paper, we concentrate on one aspect of the search technology, namely the process of collectingweb pages that eventually constitute the search engine corpus.Search engines collect pages in many ways, among them direct URL submission, paid inclusion, and URL extraction from nonweb sources, but the bulk of the corpus is obtained by recursively exploring the web, a process known as crawling or SPIDERing. The basic algorithm is(a) Fetch a page(b) Parse it to extract all linked URLs(c) For all the URLs not seen before, repeat (a)–(c)Crawling typically starts from a set of seed URLs, made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls. Sometimes crawls are started from a single well connected page, or a directory such as , but in this case a relatively large portion of the web (estimated at over 20%) is never reached. See [9] for a discussion of the graph structure of the web that leads to this phenomenon.If we view web pages as nodes in a graph, and hyperlinks as directed edges among these nodes, then crawling becomes a process known in mathematical circles as graph traversal. Various strategies for graph traversal differ in their choice of which node among the nodes not yet explored to explore next. Two standard strategies for graph traversal are Depth First Search (DFS) and Breadth First Search (BFS) – they are easy to implement and taught in many introductory algorithms classes. (See for instance [34]).However, crawling the web is not a trivial programming exercise but a serious algorithmic and system design challenge because of the following two factors.1. The web is very large. Currently, Google [20] claims to have indexed over 3 billion pages. Various studies [3, 27, 28] have indicated that, historically, the web has doubled every 9-12 months.2. Web pages are changing rapidly. If “change” means “any change”, then about 40% of all web pages change weekly [12]. Even if we consider only pages that change by a third or more, about 7% of all web pages change weekly [17].These two factors imply that to obtain a reasonably fresh and 679 completesnapshot of the web, a search engine must crawl at least 100 million pages per day. Therefore, step (a) must be executed about 1,000 times per second, and the membership test in step (c) must be done well over ten thousand times per second, against a set of URLs that is too large to store in main memory. In addition, crawlers typically use a distributed architecture to crawl more pages in parallel, which further complicates the membership test: it is possible that the membership question can only be answered by a peer node, not locally.A crucial way to speed up the membership test is to cache a (dynamic) subset of the “seen” URLs in main memory. The main goal of this paper is to investigate in depth several URL caching techniques for web crawling. We examined four practical techniques: random replacement, static cache, LRU, and CLOCK, and compared them against two theoretical limits: clairvoyant caching and infinite cache when run against a trace of a web crawl that issued over one billion HTTP requests. We found that simple caching techniques are extremely effective even at relatively small cache sizes such as 50,000 entries and show how these caches can be implemented very efficiently.The paper is organized as follows: Section 2 discusses the various crawling solutions proposed in the literature and how caching fits in their model. Section 3 presents an introduction to caching techniques and describes several theoretical and practical algorithms for caching. We implemented these algorithms under the experimental setup described in Section 4. The results of our simulations are depicted and discussed in Section 5, and our recommendations for practical algorithms and data structures for URL caching are presented in Section 6. Section 7 contains our conclusions and directions for further research.2. CRAWLINGWeb crawlers are almost as old as the web itself, and numerous crawling systems have been described in the literature. In this section, we present a brief survey of these crawlers (in historical order) and then discuss why most of these crawlers could benefit from URL caching.The crawler used by the Internet Archive [10] employs multiple crawlingprocesses, each of which performs an exhaustive crawl of 64 hosts at a time. The crawling processes save non-local URLs to disk; at the end of a crawl, a batch job adds these URLs to the per-host seed sets of the next crawl.The original Google crawler, described in [7], implements the different crawler components as different processes. A single URL server process maintains the set of URLs to download; crawling processes fetch pages; indexing processes extract words and links; and URL resolver processes convert relative into absolute URLs, which are then fed to the URL Server. The various processes communicate via the file system.For the experiments described in this paper, we used the Mercator web crawler [22, 29]. Mercator uses a set of independent, communicating web crawler processes. Each crawler process is responsible for a subset of all web servers; the assignment of URLs to crawler processes is based on a hash of the URL’s host component. A crawler that discovers an URL for which it is not responsible sends this URL via TCP to the crawler that is responsible for it, batching URLs together to minimize TCP overhead. We describe Mercator in more detail in Section 4.Cho and Garcia-Molina’s crawler [13] is similar to Mercator. The system is composed of multiple independent, communicating web crawler processes (called “C-procs”). Cho and Garcia-Molina consider different schemes for partitioning the URL space, including URL-based (assigning an URL to a C-proc based on a hash of the entire URL), site-based (assigning an URL to a C-proc based on a hash of the URL’s host part), and hierarchic al (assigning an URL to a C-proc based on some property of the URL, such as its top-level domain).The WebFountain crawler [16] is also composed of a set of independent, communicating crawling processes (the “ants”). An ant that discovers an URL for wh ich it is not responsible, sends this URL to a dedicated process (the “controller”), which forwards the URL to the appropriate ant.UbiCrawler (formerly known as Trovatore) [4, 5] is again composed of multiple independent, communicating web crawler processes. It also employs a controller process which oversees the crawling processes, detects process failures, and initiates fail-over to other crawling processes.Shkapenyuk and Suel’s crawler [35] is similar to Google’s; the different crawler componen ts are implemented as different processes. A “crawling application” maintains the set of URLs to be downloaded, and schedules the order in which to download them. It sends download requests to a “crawl manager”, which forwards them to a pool of “downloader” processes. The downloader processes fetch the pages and save them to an NFS-mounted file system. The crawling application reads those saved pages, extracts any links contained within them, and adds them to the set of URLs to be downloaded.Any web crawler must maintain a collection of URLs that are to be downloaded. Moreover, since it would be unacceptable to download the same URL over and over, it must have a way to avoid adding URLs to the collection more than once. Typically, avoidance is achieved by maintaining a set of discovered URLs, covering the URLs in the frontier as well as those that have already been downloaded. If this set is too large to fit in memory (which it often is, given that there are billions of valid URLs), it is stored on disk and caching popular URLs in memory is a win: Caching allows the crawler to discard a large fraction of the URLs without having to consult the disk-based set.Many of the distributed web crawlers described above, namely Mercator [29], WebFountain [16], UbiCrawler[4], and Cho and Molina’s crawler [13], are comprised of cooperating crawling processes, each of which downloads web pages, extracts their links, and sends these links to the peer crawling process responsible for it. However, there is no need to send a URL to a peer crawling process more than once. Maintaining a cache of URLs and consulting that cache before sending a URL to a peer crawler goes a long way toward reducing transmissions to peer crawlers, as we show in the remainder of this paper.3. CACHINGIn most computer systems, memory is hierarchical, that is, there exist two or more levels of memory, representing different tradeoffs between size and speed. For instance, in a typical workstation there is a very small but very fast on-chip memory, a larger but slower RAM memory, and a very large and much slower disk memory. In anetwork environment, the hierarchy continues with network accessible storage and so on. Caching is the idea of storing frequently used items from a slower memory in a faster memory. In the right circumstances, caching greatly improves the performance of the overall system and hence it is a fundamental technique in the design of operating systems, discussed at length in any standard textbook [21, 37]. In the web context, caching is often mentionedin the context of a web proxy caching web pages [26, Chapter 11]. In our web crawler context, since the number of visited URLs becomes too large to store in main memory, we store the collection of visited URLs on disk, and cache a small portion in main memory.Caching terminology is as follows: the cache is memory used to store equal sized atomic items. A cache has size k if it can store at most k items.1 At each unit of time, the cache receives a request for an item. If the requested item is in the cache, the situation is called a hit and no further action is needed. Otherwise, the situation is called a miss or a fault. If the cache has fewer than k items, the missed item is added to the cache. Otherwise, the algorithm must choose either to evict an item from the cache to make room for the missed item, or not to add the missed item. The caching policy or caching algorithm decides which item to evict. The goal of the caching algorithm is to minimize the number of misses.Clearly, the larger the cache, the easier it is to avoid misses. Therefore, the performance of a caching algorithm is characterized by the miss ratio for a given size cache. In general, caching is successful for two reasons:_ Non-uniformity of requests. Some requests are much more popular than others. In our context, for instance, a link to is a much more common occurrence than a link to the authors’ home pages._ Temporal correlation or locality of reference. Current requests are more likely to duplicate requests made in the recent past than requests made long ago. The latter terminology comes from the computer memory model – data needed now is likely to be close in the address space to data recently needed. In our context, temporal correlation occurs first because links tend to be repeated on the same page – we foundthat on average about 30% are duplicates, cf. Section 4.2, and second, because pages on a given host tend to be explored sequentially and they tend to share many links. For example, many pages on a Computer Science department server are likely to share links to other Computer Science departments in the world, notorious papers, etc.Because of these two factors, a cache that contains popular requests and recent requests is likely to perform better than an arbitrary cache. Caching algorithms try to capture this intuition in various ways.We now describe some standard caching algorithms, whose performance we evaluate in Section 5.3.1 Infinite cache (INFINITE)This is a theoretical algorithm that assumes that the size of the cache is larger than the number of distinct requests.3.2 Clairvoyant caching (MIN)More than 35 years ago, L´aszl´o Belady [2] showed that if the entire sequence of requests is known in advance (in other words, the algorithm is clairvoyant), then the best strategy is to evict the item whose next request is farthest away in time. This theoretical algorithm is denoted MIN because it achieves the minimum number of misses on any sequence and thus it provides a tight bound on performance.3.3 Least recently used (LRU)The LRU algorithm evicts the item in the cache that has not been requested for the longest time. The intuition for LRU is that an item that has not been needed for a long time in the past will likely not be needed for a long time in the future, and therefore the number of misses will be minimized in the spirit of Belady’s algorithm.Despite the admonition that “past performance is no guarantee of future results”, sadly verified by the current state of the stock markets, in practice, LRU is generally very effective. However, it requires maintaining a priority queue of requests. This queue has a processing time cost and a memory cost. The latter is usually ignored in caching situations where the items are large.3.4 CLOCKCLOCK is a popular approximation of LRU, invented in the late sixties [15]. An array of mark bits M0;M1; : : : ;Mk corresponds to the items currently in the cache of size k. The array is viewed as a circle, that is, the first location follows the last. A clock handle points to one item in the cache. When a request X arrives, if the item X is in the cache, then its mark bit is turned on. Otherwise, the handle moves sequentially through the array, turning the mark bits off, until an unmarked location is found. The cache item corresponding to the unmarked location is evicted and replaced by X.3.5 Random replacement (RANDOM)Random replacement (RANDOM) completely ignores the past. If the item requested is not in the cache, then a random item from the cache is evicted and replaced.In most practical situations, random replacement performs worse than CLOCK but not much worse. Our results exhibit a similar pattern, as we show in Section 5. RANDOM can be implemented without any extra space cost; see Section 6.3.6 Static caching (STATIC)If we assume that each item has a certain fixed probability of being requested, independently of the previous history of requests, then at any point in time the probability of a hit in a cache of size k is maximized if the cache contains the k items that have the highest probability of being requested.There are two issues with this approach: the first is that in general these probabilities are not known in advance; the second is that the independence of requests, although mathematically appealing, is antithetical to the locality of reference present in most practical situations.In our case, the first issue can be finessed: we might assume that the most popular k URLs discovered in a previous crawl are pretty much the k most popular URLs in the current crawl. (There are also efficient techniques for discovering the most popular items in a stream of data [18, 1, 11]. Therefore, an on-line approach might work as well.) Of course, for simulation purposes we can do a first pass over our input to determine the k most popular URLs, and then preload the cache withthese URLs, which is what we did in our experiments.The second issue above is the very reason we decided to test STATIC: if STATIC performs well, then the conclusion is that there is little locality of reference. If STATIC performs relatively poorly, then we can conclude that our data manifests substantial locality of reference, that is, successive requests are highly correlated.4. EXPERIMENTAL SETUPWe now describe the experiment we conducted to generate the crawl trace fed into our tests of the various algorithms. We conducted a large web crawl using an instrumented version of the Mercator web crawler [29]. We first describe the Mercator crawler architecture, and then report on our crawl.4.1 Mercator crawler architectureA Mercator crawling system consists of a number of crawling processes, usually running on separate machines. Each crawling process is responsible for a subset of all web servers, and consists of a number of worker threads (typically 500) responsible for downloading and processing pages from these servers.Each worker thread repeatedly performs the following operations: it obtains a URL from the URL Frontier, which is a diskbased data structure maintaining the set of URLs to be downloaded; downloads the corresponding page using HTTP into a buffer (called a RewindInputStream or RIS for short); and, if the page is an HTML page, extracts all links from the page. The stream of extracted links is converted into absolute URLs and run through the URL Filter, which discards some URLs based on syntactic properties. For example, it discards all URLs belonging to web servers that contacted us and asked not be crawled.The URL stream then flows into the Host Splitter, which assigns URLs to crawling processes using a hash of the URL’s host name. Since most links are relative, most of the URLs (81.5% in our experiment) will be assigned to the local crawling process; the others are sent in batches via TCP to the appropriate peer crawling processes. Both the stream of local URLs and the stream of URLs received from peer crawlers flow into the Duplicate URL Eliminator (DUE). The DUE discards URLs that have been discovered previously. The new URLs are forwarded to the URLFrontier for future download. In order to eliminate duplicate URLs, the DUE must maintain the set of all URLs discovered so far. Given that today’s web contains several billion valid URLs, the memory requirements to maintain such a set are significant. Mercator can be configured to maintain this set as a distributed in-memory hash table (where each crawling process maintains the subset of URLs assigned to it); however, this DUE implementation (which reduces URLs to 8-byte checksums, and uses the first 3 bytes of the checksum to index into the hash table) requires about 5.2 bytes per URL, meaning that it takes over 5 GB of RAM per crawling machine to maintain a set of 1 billion URLs per machine. These memory requirements are too steep in many settings, and in fact, they exceeded the hardware available to us for this experiment. Therefore, we used an alternative DUE implementation that buffers incoming URLs in memory, but keeps the bulk of URLs (or rather, their 8-byte checksums) in sorted order on disk. Whenever the in-memory buffer fills up, it is merged into the disk file (which is a very expensive operation due to disk latency) and newly discovered URLs are passed on to the Frontier.Both the disk-based DUE and the Host Splitter benefit from URL caching. Adding a cache to the disk-based DUE makes it possible to discard incoming URLs that hit in the cache (and thus are duplicates) instead of adding them to the in-memory buffer. As a result, the in-memory buffer fills more slowly and is merged less frequently into the disk file, thereby reducing the penalty imposed by disk latency. Adding a cache to the Host Splitter makes it possible to discard incoming duplicate URLs instead of sending them to the peer node, thereby reducing the amount of network traffic. This reduction is particularly important in a scenario where the individual crawling machines are not connected via a high-speed LAN (as they were in our experiment), but are instead globally distributed. In such a setting, each crawler would be responsible for web servers “close to it”.Mercator performs an approximation of a breadth-first search traversal of the web graph. Each of the (typically 500) threads in each process operates in parallel, which introduces a certain amount of non-determinism to the traversal. More importantly, the scheduling of downloads is moderated by Mercator’s politenesspolicy, which limits the load placed by the crawler on any particular web server. Mercato r’s politeness policy guarantees that no server ever receives multiple requests from Mercator in parallel; in addition, it guarantees that the next request to a server will only be issued after a multiple (typically 10_) of the time it took to answer the previous request has passed. Such a politeness policy is essential to any large-scale web crawler; otherwise the crawler’s operator becomes inundated with complaints. 4.2 Our web crawlOur crawling hardware consisted of four Compaq XP1000 workstations, each one equipped with a 667 MHz Alpha processor, 1.5 GB of RAM, 144 GB of disk2, and a 100 Mbit/sec Ethernet connection. The machines were located at the Palo Alto Internet Exchange, quite close to the Internet’s backbone.The crawl ran from July 12 until September 3, 2002, although it was actively crawling only for 33 days: the downtimes were due to various hardware and network failures. During the crawl, the four machines performed 1.04 billion download attempts, 784 million of which resulted in successful downloads. 429 million of the successfully downloaded documents were HTML pages. These pages contained about 26.83 billion links, equivalent to an average of 62.55 links per page; however, the median number of links per page was only 23, suggesting that the average is inflated by some pages with a very high number of links. Earlier studies reported only an average of 8 links [9] or 17 links per page [33]. We offer three explanations as to why we found more links per page. First, we configured Mercator to not limit itself to URLs found in anchor tags, but rather to extract URLs from all tags that may contain them (e.g. image tags). This configuration increases both the mean and the median number of links per page. Second, we configured it to download pages up to 16 MB in size (a setting that is significantly higher than usual), making it possible to encounter pages with tens of thousands of links. Third, most studies report the number of unique links per page. The numbers above include duplicate copies of a link on a page. If we only consider unique links3 per page, then the average number of links is 42.74 and the median is 17.The links extracted from these HTML pages, plus about 38 million HTTPredirections that were encountered during the crawl, flowed into the Host Splitter. In order to test the effectiveness of various caching algorithms, we instrumented Mercator’s Host Splitter component to log all incoming URLs to disk. The Host Splitters on the four crawlers received and logged a total of 26.86 billion URLs.After completion of the crawl, we condensed the Host Splitter logs. We hashed each URL to a 64-bit fingerprint [32, 8]. Fingerprinting is a probabilistic technique; there is a small chance that two URLs have the same fingerprint. We made sure there were no such unintentional collisions by sorting the original URL logs and counting the number of unique URLs. We then compared this number to the number of unique fingerprints, which we determined using an in-memory hash table on a very-large-memory machine. This data reduction step left us with four condensed host splitter logs (one per crawling machine), ranging from 51 GB to 57 GB in size and containing between 6.4 and 7.1 billion URLs.In order to explore the effectiveness of caching with respect to inter-process communication in a distributed crawler, we also extracted a sub-trace of the Host Splitter logs that contained only those URLs that were sent to peer crawlers. These logs contained 4.92 billion URLs, or about 19.5% of all URLs. We condensed the sub-trace logs in the same fashion. We then used the condensed logs for our simulations.5. SIMULATION RESULTSWe studied the effects of caching with respect to two streams of URLs:1. A trace of all URLs extracted from the pages assigned to a particular machine. We refer to this as the full trace.2. A trace of all URLs extracted from the pages assigned to a particular machine that were sent to one of the other machines for processing. We refer to this trace as the cross subtrace, since it is a subset of the full trace.The reason for exploring both these choices is that, depending on other architectural decisions, it might make sense to cache only the URLs to be sent to other machines or to use a separate cache just for this purpose.We fed each trace into implementations of each of the caching algorithmsdescribed above, configured with a wide range of cache sizes. We performed about 1,800 such experiments. We first describe the algorithm implementations, and then present our simulation results.5.1 Algorithm implementationsThe implementation of each algorithm is straightforward. We use a hash table to find each item in the cache. We also keep a separate data structure of the cache items, so that we can choose one for eviction. For RANDOM, this data structure is simply a list. For CLOCK, it is a list and a clock handle, and the items also contain “mark” bits. For LRU, it is a heap, organized by last access time. STATIC needs no extra data structure, since it never evicts items. MIN is more complicated since for each item in the cache, MIN needs to know when the next request for that item will be. We therefore describe MIN in more detail. Let A be the trace or sequence of requests, that is, At is the item requested at time t. We create a second sequence Nt containing the time when At next appears in A. If there is no further request for At after time t, we set Nt = 1. Formally,To generate the sequence Nt, we read the trace A backwards, that is, from tmax down to 0, and use a hash table with key At and value t. For each item At, we probe the hash table. If it is not found, we set Nt = 1and store (At; t) in the table. If it is found, we retrieve (At; t0), set Nt = t0, and replace (At; t0) by (At; t) in the hash table. Given Nt, implementing MIN is easy: we read At and Nt in parallel, and hence for each item requested, we know when it will be requested next. We tag each item in the cache with the time when it will be requested next, and if necessary, evict the item with the highest value for its next request, using a heap to identify itquickly.5.2 ResultsWe present the results for only one crawling host. The results for the other three hosts are quasi-identical. Figure 2 shows the miss rate over the entire trace (that is, the percentage of misses out of all requests to the cache) as a function of the size of the cache. We look at cache sizes from k = 20 to k = 225. In Figure 3 we present the same data relative to the miss-rate of MIN, the optimum off-line algorithm. The same simulations for the cross-trace are depicted in Figures 4 and 5.。

The Anatomy of a Large-Scale Hypertextual Web Search Engine完整中文翻译

The Anatomy of a Large-Scale Hypertextual Web Search Engine完整中文翻译

本文是谷歌创始人Sergey和Larry在斯坦福大学计算机系读博士时的一篇论文。

发表于1997年。

在网络中并没有完整的中文译本,现将原文和本人翻译的寥寥几句和网络收集的片段(网友xfygx和雷声大雨点大的无私贡献)整理和综合到一起,翻译时借助了,因为是技术性的论文,文中有大量的合成的术语和较长的句子,有些进行了意译而非直译。

作为Google辉煌的起始,这篇文章非常有纪念价值,但是文中提到的内容因年代久远,已经和时下最新的技术有了不少差异。

但是文中的思想还是有很多借鉴价值。

因本人水平有限,对文中内容可能会有理解不当之处,请您查阅英文原版。

大规模的超文本网页搜索引擎的分析Sergey Brin and Lawrence Page{sergey, page}@Computer Science Department, Stanford University, Stanford, CA 94305摘要在本文中我们讨论Google,一个充分利用超文本文件结构进行搜索的大规模搜索引擎的原型。

Google可以有效地对网络资源进行爬行搜索和索引,比目前已经存在的系统有更令人满意的搜索结果。

该原型的数据库包括2400万页面的全文和之间的链接,可通过/访问。

设计一个搜索引擎是一种具挑战性的任务。

搜索引擎索索引数以亿计的不同类型的网页并每天给出过千万的查询的答案。

尽管大型搜索引擎对于网站非常重要,但是已完成的、对于大型搜索引擎的学术上的研究却很少。

此外,由于技术上的突飞猛进和网页的急剧增加,在当前,创建一个搜索引擎和三年前已不可同日而语。

本文提供了一种深入的描述,与 Web 增殖快速进展今日创建 Web 搜索引擎是三年前很大不同。

本文提供了到目前为止,对于我们大型的网页所搜引擎的深入的描述,这是第一个这样详细的公共描述。

除了如何把传统的搜索技术扩展到前所未有的海量数据,还有新的技术挑战涉及到了使用超文本中存在的其他附加信息产生更好的搜索结果。

网络爬虫de基础知识

网络爬虫de基础知识

相对于通用网络爬虫,聚焦爬虫还需 要解决三个主要问题:
(1)对抓取目标的描述或定义; (2)对网页或数据的分析与过滤; (3)对URL的搜索策略。 抓取目标的描述和定义是决定网页分析 算法与URL搜索策略如何制订的基础。而网 页分析算法和候选URL排序算法是决定搜索 引擎所提供的服务形式和爬虫网页抓取行为 的关键所在。这两个部分的算法又是紧密相 关的。
另外一种方法是将广度优先搜索与网 页过滤技术结合使用,先用广度优先策略抓 取网页,再将其中无关的网页过滤掉。这些 方法的缺点在于,随着抓取网页的增多,大 量的无关网页将被下载并过滤,算法的效率 将变低。
3.1.2 最佳优先搜索策略
最佳优先搜索策略按照一定的网页分析 算法,预测候选URL与目标网页的相似度, 或与主题的相关性,并选取评价最好的一个 或几个URL进行抓取。它只访问经过网页分 析算法预测为“有用”的网页。
Abiteboul
设计了一种基于OPIC(在线页面重要指数)的抓取战略。 在OPIC中,每一个页面都有一个相等的初始权值,并把这些权值平均 分给它所指向的页面。这种算法与PageRank相似,但是它的速度很 快,并且可以一次完成。OPIC的程序首先抓取获取权值最大的页面, 实验在10万个幂指分布的模拟页面中进行。但是,实验没有和其它策 略进行比较,也没有在真正的WEB页面测试。
后期Google的改进主要有: (1)采用自有的文件系统(GFS)和数据库系统 (Big Table)来存取数据; (2)采用Map Reduce技术来分布式处理各种数 据的运算。
4.2 Mercator
康柏系统研究中心的AIlan Heydon和 Marc Najork设计了名叫Mercator的爬行器。 系统采用Java的多线程同步方式实现并行处 理,并加入了很多优化策略如DNS缓冲、延 迟存储等以提升爬行器运行效率。它采用的 数据结构可以不管爬行规模的大小,在内存 中只占有限的空间。这些数据结构的大部分 都在磁盘上,在内存中只存放有限的部分, 伸缩性很强。

网 络 爬 虫

网 络 爬 虫

搜索引擎爬虫技术简介网络爬虫源自Spider (或Crawler、robotswanderer)等的意译。

网络爬虫的定义有广义和狭义之分,狭义的定义为:利用标准的http协议,根据超级链接和Web文档检索的方法遍历万维网信息空间的软件程序。

广义的定义为:所有能利用http协议检索Web文档的软件都称之为网络爬虫。

网络爬虫是一个功能强大的自动提取网页的程序,它为搜索引擎从万维网下载网页,是搜索引擎的重要组成部分。

它可以完全不依赖用户干预实现网络上的自动“爬行”和搜索。

1网络爬虫的搜索策略网络爬虫程序与传统搜索技术相比有着不可比拟的优越性,其中爬虫程序的搜索策略起到了和好的作用。

1.1深度优先搜索策略在一个HTML文件中,当一个超级链接被选择后,被链接的HTML文件将执行深度优先搜索,即在搜索其余的超级链接结果之前必须先完整地搜索单独的一条链。

其优点是能遍历一个Web站点或深层嵌套的文档集合。

缺点是因为Web结构相当深,有可能造成一旦进去再也出不来的情况发生。

1.2宽度优先搜索策略在宽度优先搜索中,先搜索完一个Web页面中所有的超级链接,然后再继续搜索下一层,直到底层为止。

宽度优先搜索策略通常是实现爬虫的最佳策略.因为它容易实现,而且具备大多数期望的功能。

但是如果要遍历一个指定的站点或者深层嵌套的HTML文件集,用宽度优先搜索策略则需要花费较长时问才能到达深层的HTML文件。

1.3聚焦搜索策略传统爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件。

聚焦爬虫的工作流程较为复杂,需要根据一定的网页分析算法过滤与主题无关的链接,保留有用的链接并将其放入等待抓取的URL队列。

然后,它将根据一定的搜索策略从队列中选择下一步要抓取的网页URL,并重复上述过程,直到达到系统的某一条件时停止。

这种策略通常运用在专业搜索引擎中,因为这种搜索引擎只关心某一特定主题的页面。

网络爬虫技术讲解

网络爬虫技术讲解
下载器中间件是在引擎及下载器之间的特定钩子(specific hook),处理Downloader传递给引擎 的response。 其提供了一个简便的机制,通过插入自定义代码来扩展Scrapy功能。 Spider中间件(Spider middlewares)
Spider中间件是在引擎及Spider之间的特定钩子(specific hook),处理spider的输入(response) 和输出(items及requests)。 其提供了一个简便的机制,通过插入自定义代码来扩展Scrapy功能。
下载器负责获取页面数据并提供给引擎,而后提供给spider。 Spiders
Spider是Scrapy用户编写用于分析response并提取item(即获取到的item)或额外跟进的URL的 类。 每个spider负责处理一个特定(或一些)网站。 Item Pipeline
Item Pipeline负责处理被spider提取出来的item。典型的处理有清理、 验证及持久化(例如存取 到数据库中)。 下载器中间件(Downloader middlewares)
11
项目架构 Scrapy-Redis:则是一个基于Redis的Scrapy分布式组件。它利用Redis对用于
爬取的请求(Requests)进行存储和调度(Schedule),并对爬取产生的项目 (items)存储以供后续处理使用。scrapy-redi重写了scrapy一些比较关键的代 码,将scrapy变成一个可以在多个主机上同时运行的分布式爬虫。
Fiddler
Fiddler是一个很好用的抓包工具,可以将网络传输发送与接收的数据包进行截获、重发、编辑 等操作。也可以用来检测流量
Fiddler是一个http协议调试代理工具,它能够记录并检查所有你的电脑和互联网之间的http通 讯,设置断点,查看所有的“进出”Fiddler的数据(指cookie,html,js,css等文件)。 Fiddler 要比 其他的网络调试器要更加简单,因为它不仅仅暴露http通讯还提供了一个用户友好的格式。

python 爬虫的原理

python 爬虫的原理

python 爬虫的原理Python web crawler (Python爬虫) is a powerful tool used to extract and store information from websites. It operates by sending HTTP requests to web pages, retrieving the HTML content, and parsing and extracting the desired data. Python爬虫是一种强大的工具,用于从网站提取和存储信息。

它通过向网页发送HTTP请求,检索HTML内容,并解析和提取所需数据来运行。

One of the key principles behind Python web crawlers is web scraping. Web scraping involves extracting information from websites, typically using programs to simulate human web browsing and retrieving information from web pages. Python web crawlers are commonly used for web scraping tasks, such as extracting product information from e-commerce sites or gathering data for research and analysis. Python爬虫背后的一个关键原则是网络爬取。

网络爬取涉及从网站提取信息,通常使用程序模拟人类浏览网络,并从网页中检索信息。

Python爬虫通常用于网络抓取任务,例如从电子商务网站提取产品信息或收集研究和分析数据。

与爬虫有关的词语

与爬虫有关的词语

与爬虫有关的词语
以下是与爬虫(Web爬虫)相关的词语:
1. 网络爬虫(Web crawler):一种自动化程序,用于在互联网上浏览和收集信息,通常用于搜索引擎索引、数据采集等任务。

2. 爬取(Crawling):指爬虫程序访问网页并提取其中的信息的过程。

3. 网页抓取(Web scraping):使用爬虫程序从网页中提取数据的过程。

4. User-Agent:爬虫程序在发送HTTP请求时,通过User-Agent头部字段标识自己的身份,以便服务器识别。

5. Robots.txt:一个位于网站根目录下的文本文件,用于告知爬虫程序哪些页面可以被爬取,哪些页面不可访问。

6. 反爬虫(Anti-crawling):网站为了防止爬虫程序访问和抓取数据,采取的一系列措施,如验证码、IP封禁等。

7. 数据清洗(Data cleansing):对从网页中抓取的数据进行处理和整理,去除冗余信息、修复错误等。

8. 数据存储(Data storage):将从网页中抓取的数据保存到数据库或文件中,以备后续分析和使用。

9. 爬虫策略(Crawling strategy):指定爬虫程序在访问网页时的行为规则,如访问频率、并发数等。

10. 反爬虫策略(Anti-crawling strategy):网站为了防止爬虫程序抓取数据而采取的措施,如限制访问频率、设置验证码等。

scrape用法 -回复

scrape用法 -回复

scrape用法-回复Scrape用法Scrape是一个英文动词,它的定义是使用锐利而硬的物体,如刀、刮刀或指甲等,把物体表面的东西迅速而轻微地移除。

在日常生活中,我们经常使用刮刀或刷子来清洁锅底、窗户玻璃或墙壁上的污渍。

而在计算机编程领域,scrape一词有着完全不同的含义。

在本文中,我们将一步一步地介绍scrape在计算机编程中的常见用法。

Scrape的一种常见用法是网络爬虫(web scraping)。

网络爬虫是一种自动化程序,可以模拟人类用户在互联网上浏览网页的行为。

它通过从网页中提取信息,建立数据集并进行分析,帮助人们快速获取大量数据。

网络爬虫在金融、医疗、市场研究等领域有着广泛的应用。

让我们一步一步地了解如何使用scrape进行网络爬虫。

首先,我们需要选择一个目标网站。

这个网站可能是一个电子商务网站、新闻网站、博客或任何其他具有我们感兴趣内容的网站。

接下来,我们需要确定我们要从网站中抓取的具体信息。

这可以是商品信息、新闻标题、博客文章等等。

一旦我们有了目标网站和要抓取的信息,我们就可以开始编写代码。

使用Python编程语言是一种常见的选择,因为Python有许多强大的库和工具可供使用。

其中一个常用的库是Beautiful Soup,它可以帮助我们解析网页并提取所需的信息。

在编写scrape代码时,我们首先需要使用合适的库下载目标网页的HTML 内容。

这可以通过发送HTTP请求到指定URL并获取响应来实现。

一旦我们获得了网页的HTML内容,我们就可以使用Beautiful Soup解析它。

Beautiful Soup提供了各种方法和属性,用于从HTML中查找、定位和提取数据。

它可以根据HTML标签的名称、属性或CSS选择器等方式来定位元素。

我们可以使用这些功能来定位我们需要的具体信息,并提取出来。

在完成了数据提取后,我们可以选择将数据保存到文件、数据库或进行进一步的分析和处理。

这取决于我们的需求和目标。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

外文资料ABSTRACTCrawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test.A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.1. INTRODUCTIONA recent Pew Foundation study [31] states that “Search engines have become an indispensable utility for Internet users” and estimates that as of mid-2002, slightlyover 50% of all Americans have used web search to find information. Hence, the technology that powers web search is of enormous practical interest. In this paper, we concentrate on one aspect of the search technology, namely the process of collecting web pages that eventually constitute the search engine corpus.Search engines collect pages in many ways, among them direct URL submission, paid inclusion, and URL extraction from nonweb sources, but the bulk of the corpus is obtained by recursively exploring the web, a process known as crawling or SPIDERing. The basic algorithm is(a) Fetch a page(b) Parse it to extract all linked URLs(c) For all the URLs not seen before, repeat (a)–(c)Crawling typically starts from a set of seed URLs, made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls. Sometimes crawls are started from a single well connected page, or a directory such as , but in this case a relatively large portion of the web (estimated at over 20%) is never reached. See [9] for a discussion of the graph structure of the web that leads to this phenomenon.If we view web pages as nodes in a graph, and hyperlinks as directed edges among these nodes, then crawling becomes a process known in mathematical circles as graph traversal. Various strategies for graph traversal differ in their choice of which node among the nodes not yet explored to explore next. Two standard strategies for graph traversal are Depth First Search (DFS) and Breadth First Search (BFS) – they are easy to implement and taught in many introductory algorithms classes. (See for instance [34]).However, crawling the web is not a trivial programming exercise but a serious algorithmic and system design challenge because of the following two factors.1. The web is very large. Currently, Google [20] claims to have indexed over 3 billion pages. Various studies [3, 27, 28] have indicated that, historically, the web has doubled every 9-12 months.2. Web pages are changing rapidly. If “change” means “any change”, then about40% of all web pages change weekly [12]. Even if we consider only pages that change by a third or more, about 7% of all web pages change weekly [17].These two factors imply that to obtain a reasonably fresh and 679 complete snapshot of the web, a search engine must crawl at least 100 million pages per day. Therefore, step (a) must be executed about 1,000 times per second, and the membership test in step (c) must be done well over ten thousand times per second, against a set of URLs that is too large to store in main memory. In addition, crawlers typically use a distributed architecture to crawl more pages in parallel, which further complicates the membership test: it is possible that the membership question can only be answered by a peer node, not locally.A crucial way to speed up the membership test is to cache a (dynamic) subset of the “seen” URLs in main memory. The main goal of this paper is to investigate in depth several URL caching techniques for web crawling. We examined four practical techniques: random replacement, static cache, LRU, and CLOCK, and compared them against two theoretical limits: clairvoyant caching and infinite cache when run against a trace of a web crawl that issued over one billion HTTP requests. We found that simple caching techniques are extremely effective even at relatively small cache sizes such as 50,000 entries and show how these caches can be implemented very efficiently.The paper is organized as follows: Section 2 discusses the various crawling solutions proposed in the literature and how caching fits in their model. Section 3 presents an introduction to caching techniques and describes several theoretical and practical algorithms for caching. We implemented these algorithms under the experimental setup described in Section 4. The results of our simulations are depicted and discussed in Section 5, and our recommendations for practical algorithms and data structures for URL caching are presented in Section 6. Section 7 contains our conclusions and directions for further research.2. CRAWLINGWeb crawlers are almost as old as the web itself, and numerous crawling systems have been described in the literature. In this section, we present a brief survey of thesecrawlers (in historical order) and then discuss why most of these crawlers could benefit from URL caching.The crawler used by the Internet Archive [10] employs multiple crawling processes, each of which performs an exhaustive crawl of 64 hosts at a time. The crawling processes save non-local URLs to disk; at the end of a crawl, a batch job adds these URLs to the per-host seed sets of the next crawl.The original Google crawler, described in [7], implements the different crawler components as different processes. A single URL server process maintains the set of URLs to download; crawling processes fetch pages; indexing processes extract words and links; and URL resolver processes convert relative into absolute URLs, which are then fed to the URL Server. The various processes communicate via the file system.For the experiments described in this paper, we used the Mercator web crawler [22, 29]. Mercator uses a set of independent, communicating web crawler processes. Each crawler process is responsible for a subset of all web servers; the assignment of URLs to crawler processes is based on a hash of the URL’s host component. A crawler that discovers an URL for which it is not responsible sends this URL via TCP to the crawler that is responsible for it, batching URLs together to minimize TCP overhead. We describe Mercator in more detail in Section 4.Cho and Garcia-Molina’s crawler [13] is similar to Mercator. The system is composed of multiple independent, communicating web crawler processes (called “C-procs”). Cho and Garcia-Molina consider different schemes for partitioning the URL space, including URL-based (assigning an URL to a C-proc based on a hash of the entire URL), site-based (assigning an URL to a C-proc based on a hash of the URL’s host part), and hierarchical (assigning an URL to a C-proc based on some property of the URL, such as its top-level domain).The WebFountain crawler [16] is also composed of a set of independent, communicating crawling processes (the “ants”). An ant that discovers an URL for which it is not responsible, sends this URL to a dedicated process (the “controller”), which forwards the URL to the appropriate ant.UbiCrawler (formerly known as Trovatore) [4, 5] is again composed of multipleindependent, communicating web crawler processes. It also employs a controller process which oversees the crawling processes, detects process failures, and initiates fail-over to other crawling processes.Shkapenyuk and Suel’s crawler [35] is similar to Google’s; the different crawler components are implemented as different processes. A “crawling application” maintains the set of URLs to be downloaded, and schedules the order in which to download them. It sends download requests to a “crawl manager”, which forwards them to a pool of “downloader” processes. The downloader processes fetch the pages and save them to an NFS-mounted file system. The crawling application reads those saved pages, extracts any links contained within them, and adds them to the set of URLs to be downloaded.Any web crawler must maintain a collection of URLs that are to be downloaded. Moreover, since it would be unacceptable to download the same URL over and over, it must have a way to avoid adding URLs to the collection more than once. Typically, avoidance is achieved by maintaining a set of discovered URLs, covering the URLs in the frontier as well as those that have already been downloaded. If this set is too large to fit in memory (which it often is, given that there are billions of valid URLs), it is stored on disk and caching popular URLs in memory is a win: Caching allows the crawler to discard a large fraction of the URLs without having to consult thedisk-based set.Many of the distributed web crawlers described above, namely Mercator [29], WebFountain [16], UbiCrawler[4], and Cho and Molina’s crawler [13], are comprised of cooperating crawling processes, each of which downloads web pages, extracts their links, and sends these links to the peer crawling process responsible for it. However, there is no need to send a URL to a peer crawling process more than once. Maintaining a cache of URLs and consulting that cache before sending a URL to a peer crawler goes a long way toward reducing transmissions to peer crawlers, as we show in the remainder of this paper.3. CACHINGIn most computer systems, memory is hierarchical, that is, there exist two or morelevels of memory, representing different tradeoffs between size and speed. For instance, in a typical workstation there is a very small but very fast on-chip memory, a larger but slower RAM memory, and a very large and much slower disk memory. In a network environment, the hierarchy continues with network accessible storage and so on. Caching is the idea of storing frequently used items from a slower memory in a faster memory. In the right circumstances, caching greatly improves the performance of the overall system and hence it is a fundamental technique in the design of operating systems, discussed at length in any standard textbook [21, 37]. In the web context, caching is often mentionedin the context of a web proxy caching web pages [26, Chapter 11]. In our web crawler context, since the number of visited URLs becomes too large to store in main memory, we store the collection of visited URLs on disk, and cache a small portion in main memory.Caching terminology is as follows: the cache is memory used to store equal sized atomic items. A cache has size k if it can store at most k items.1 At each unit of time, the cache receives a request for an item. If the requested item is in the cache, the situation is called a hit and no further action is needed. Otherwise, the situation is called a miss or a fault. If the cache has fewer than k items, the missed item is added to the cache. Otherwise, the algorithm must choose either to evict an item from the cache to make room for the missed item, or not to add the missed item. The caching policy or caching algorithm decides which item to evict. The goal of the caching algorithm is to minimize the number of misses.Clearly, the larger the cache, the easier it is to avoid misses. Therefore, the performance of a caching algorithm is characterized by the miss ratio for a given size cache. In general, caching is successful for two reasons:_ Non-uniformity of requests. Some requests are much more popular than others. In our context, for instance, a link to is a much more common occurrence than a link to the authors’ home pages._ Temporal correlation or locality of reference. Current requests are more likely to duplicate requests made in the recent past than requests made long ago. The latterterminology comes from the computer memory model – data needed now is likely to be close in the address space to data recently needed. In our context, temporal correlation occurs first because links tend to be repeated on the same page – we found that on average about 30% are duplicates, cf. Section 4.2, and second, because pages on a given host tend to be explored sequentially and they tend to share many links. For example, many pages on a Computer Science department server are likely to share links to other Computer Science departments in the world, notorious papers, etc.Because of these two factors, a cache that contains popular requests and recent requests is likely to perform better than an arbitrary cache. Caching algorithms try to capture this intuition in various ways.We now describe some standard caching algorithms, whose performance we evaluate in Section 5.3.1 Infinite cache (INFINITE)This is a theoretical algorithm that assumes that the size of the cache is larger than the number of distinct requests.3.2 Clairvoyant caching (MIN)More than 35 years ago, L´aszl´o Belady [2] showed that if the entire sequence of requests is known in advance (in other words, the algorithm is clairvoyant), then the best strategy is to evict the item whose next request is farthest away in time. This theoretical algorithm is denoted MIN because it achieves the minimum number of misses on any sequence and thus it provides a tight bound on performance.3.3 Least recently used (LRU)The LRU algorithm evicts the item in the cache that has not been requested for the longest time. The intuition for LRU is that an item that has not been needed for a long time in the past will likely not be needed for a long time in the future, and therefore the number of misses will be minimized in the spirit of Belady’s algorithm.Despite the admonition that “past performance is no guarantee of future results”, sadly verified by the current state of the stock markets, in practice, LRU is generally very effective. However, it requires maintaining a priority queue of requests. Thisqueue has a processing time cost and a memory cost. The latter is usually ignored in caching situations where the items are large.3.4 CLOCKCLOCK is a popular approximation of LRU, invented in the late sixties [15]. An array of mark bits M0;M1; : : : ;Mk corresponds to the items currently in the cache of size k. The array is viewed as a circle, that is, the first location follows the last. A clock handle points to one item in the cache. When a request X arrives, if the item X is in the cache, then its mark bit is turned on. Otherwise, the handle moves sequentially through the array, turning the mark bits off, until an unmarked location is found. The cache item corresponding to the unmarked location is evicted and replaced by X.3.5 Random replacement (RANDOM)Random replacement (RANDOM) completely ignores the past. If the item requested is not in the cache, then a random item from the cache is evicted and replaced.In most practical situations, random replacement performs worse than CLOCK but not much worse. Our results exhibit a similar pattern, as we show in Section 5. RANDOM can be implemented without any extra space cost; see Section 6.3.6 Static caching (STATIC)If we assume that each item has a certain fixed probability of being requested, independently of the previous history of requests, then at any point in time the probability of a hit in a cache of size k is maximized if the cache contains the k items that have the highest probability of being requested.There are two issues with this approach: the first is that in general these probabilities are not known in advance; the second is that the independence of requests, although mathematically appealing, is antithetical to the locality of reference present in most practical situations.In our case, the first issue can be finessed: we might assume that the most popular k URLs discovered in a previous crawl are pretty much the k most popular URLs in the current crawl. (There are also efficient techniques for discovering themost popular items in a stream of data [18, 1, 11]. Therefore, an on-line approach might work as well.) Of course, for simulation purposes we can do a first pass over our input to determine the k most popular URLs, and then preload the cache with these URLs, which is what we did in our experiments.The second issue above is the very reason we decided to test STATIC: if STATIC performs well, then the conclusion is that there is little locality of reference. If STATIC performs relatively poorly, then we can conclude that our data manifests substantial locality of reference, that is, successive requests are highly correlated.4. EXPERIMENTAL SETUPWe now describe the experiment we conducted to generate the crawl trace fed into our tests of the various algorithms. We conducted a large web crawl using an instrumented version of the Mercator web crawler [29]. We first describe the Mercator crawler architecture, and then report on our crawl.4.1 Mercator crawler architectureA Mercator crawling system consists of a number of crawling processes, usually running on separate machines. Each crawling process is responsible for a subset of all web servers, and consists of a number of worker threads (typically 500) responsible for downloading and processing pages from these servers.Each worker thread repeatedly performs the following operations: it obtains a URL from the URL Frontier, which is a diskbased data structure maintaining the set of URLs to be downloaded; downloads the corresponding page using HTTP into a buffer (called a RewindInputStream or RIS for short); and, if the page is an HTML page, extracts all links from the page. The stream of extracted links is converted into absolute URLs and run through the URL Filter, which discards some URLs based on syntactic properties. For example, it discards all URLs belonging to web servers that contacted us and asked not be crawled.The URL stream then flows into the Host Splitter, which assigns URLs to crawling processes using a hash of the URL’s host name. Since most links are relative, most of the URLs (81.5% in our experiment) will be assigned to the local crawling process; the others are sent in batches via TCP to the appropriate peer crawlingprocesses. Both the stream of local URLs and the stream of URLs received from peer crawlers flow into the Duplicate URL Eliminator (DUE). The DUE discards URLs that have been discovered previously. The new URLs are forwarded to the URL Frontier for future download. In order to eliminate duplicate URLs, the DUE must maintain the set of all URLs discovered so far. Given that today’s web contains several billion valid URLs, the memory requirements to maintain such a set are significant. Mercator can be configured to maintain this set as a distributedin-memory hash table (where each crawling process maintains the subset of URLs assigned to it); however, this DUE implementation (which reduces URLs to 8-byte checksums, and uses the first 3 bytes of the checksum to index into the hash table) requires about 5.2 bytes per URL, meaning that it takes over 5 GB of RAM per crawling machine to maintain a set of 1 billion URLs per machine. These memory requirements are too steep in many settings, and in fact, they exceeded the hardware available to us for this experiment. Therefore, we used an alternative DUE implementation that buffers incoming URLs in memory, but keeps the bulk of URLs (or rather, their 8-byte checksums) in sorted order on disk. Whenever the in-memory buffer fills up, it is merged into the disk file (which is a very expensive operation due to disk latency) and newly discovered URLs are passed on to the Frontier.Both the disk-based DUE and the Host Splitter benefit from URL caching. Adding a cache to the disk-based DUE makes it possible to discard incoming URLs that hit in the cache (and thus are duplicates) instead of adding them to the in-memory buffer. As a result, the in-memory buffer fills more slowly and is merged less frequently into the disk file, thereby reducing the penalty imposed by disk latency. Adding a cache to the Host Splitter makes it possible to discard incoming duplicate URLs instead of sending them to the peer node, thereby reducing the amount of network traffic. This reduction is particularly important in a scenario where the individual crawling machines are not connected via a high-speed LAN (as they were in our experiment), but are instead globally distributed. In such a setting, each crawler would be responsible for web servers “close to it”.Mercator performs an approximation of a breadth-first search traversal of theweb graph. Each of the (typically 500) threads in each process operates in parallel, which introduces a certain amount of non-determinism to the traversal. More importantly, the scheduling of downloads is moderated by Mercator’s politeness policy, which limits the load placed by the crawler on any particular web server. Mercator’s politeness policy guarantees that no server ever receives multiple requests from Mercator in parallel; in addition, it guarantees that the next request to a server will only be issued after a multiple (typically 10_) of the time it took to answer the previous request has passed. Such a politeness policy is essential to any large-scale web crawler; otherwise the crawler’s operator becomes inundated with complaints. 4.2 Our web crawlOur crawling hardware consisted of four Compaq XP1000 workstations, each one equipped with a 667 MHz Alpha processor, 1.5 GB of RAM, 144 GB of disk2, and a 100 Mbit/sec Ethernet connection. The machines were located at the Palo Alto Internet Exchange, quite close to the Internet’s backbone.The crawl ran from July 12 until September 3, 2002, although it was actively crawling only for 33 days: the downtimes were due to various hardware and network failures. During the crawl, the four machines performed 1.04 billion download attempts, 784 million of which resulted in successful downloads. 429 million of the successfully downloaded documents were HTML pages. These pages contained about 26.83 billion links, equivalent to an average of 62.55 links per page; however, the median number of links per page was only 23, suggesting that the average is inflated by some pages with a very high number of links. Earlier studies reported only an average of 8 links [9] or 17 links per page [33]. We offer three explanations as to why we found more links per page. First, we configured Mercator to not limit itself to URLs found in anchor tags, but rather to extract URLs from all tags that may contain them (e.g. image tags). This configuration increases both the mean and the median number of links per page. Second, we configured it to download pages up to 16 MB in size (a setting that is significantly higher than usual), making it possible to encounter pages with tens of thousands of links. Third, most studies report the number of unique links per page. The numbers above include duplicate copies of a link on a page. If weonly consider unique links3 per page, then the average number of links is 42.74 and the median is 17.The links extracted from these HTML pages, plus about 38 million HTTP redirections that were encountered during the crawl, flowed into the Host Splitter. In order to test the effectiveness of various caching algorithms, we instrumented Mercator’s Host Splitter component to log all incoming URLs to disk. The Host Splitters on the four crawlers received and logged a total of 26.86 billion URLs.After completion of the crawl, we condensed the Host Splitter logs. We hashed each URL to a 64-bit fingerprint [32, 8]. Fingerprinting is a probabilistic technique; there is a small chance that two URLs have the same fingerprint. We made sure there were no such unintentional collisions by sorting the original URL logs and counting the number of unique URLs. We then compared this number to the number of unique fingerprints, which we determined using an in-memory hash table on avery-large-memory machine. This data reduction step left us with four condensed host splitter logs (one per crawling machine), ranging from 51 GB to 57 GB in size and containing between 6.4 and 7.1 billion URLs.In order to explore the effectiveness of caching with respect to inter-process communication in a distributed crawler, we also extracted a sub-trace of the Host Splitter logs that contained only those URLs that were sent to peer crawlers. These logs contained 4.92 billion URLs, or about 19.5% of all URLs. We condensed the sub-trace logs in the same fashion. We then used the condensed logs for our simulations.5. SIMULATION RESULTSWe studied the effects of caching with respect to two streams of URLs:1. A trace of all URLs extracted from the pages assigned to a particular machine. We refer to this as the full trace.2. A trace of all URLs extracted from the pages assigned to a particular machine that were sent to one of the other machines for processing. We refer to this trace as the cross subtrace, since it is a subset of the full trace.The reason for exploring both these choices is that, depending on otherarchitectural decisions, it might make sense to cache only the URLs to be sent to other machines or to use a separate cache just for this purpose.We fed each trace into implementations of each of the caching algorithms described above, configured with a wide range of cache sizes. We performed about 1,800 such experiments. We first describe the algorithm implementations, and then present our simulation results.5.1 Algorithm implementationsThe implementation of each algorithm is straightforward. We use a hash table to find each item in the cache. We also keep a separate data structure of the cache items, so that we can choose one for eviction. For RANDOM, this data structure is simply a list. For CLOCK, it is a list and a clock handle, and the items also contain “mark” bits. For LRU, it is a heap, organized by last access time. STATIC needs no extra data structure, since it never evicts items. MIN is more complicated since for each item in the cache, MIN needs to know when the next request for that item will be. We therefore describe MIN in more detail. Let A be the trace or sequence of requests, that is, At is the item requested at time t. We create a second sequence Nt containing the time when At next appears in A. If there is no further request for At after time t, we set Nt = 1. Formally,To generate the sequence Nt, we read the trace A backwards, that is, from tmax down to 0, and use a hash table with key At and value t. For each item At, we probe the hash table. If it is not found, we set Nt = 1and store (At; t) in the table. If it is found, we retrieve (At; t0), set Nt = t0, and replace (At; t0) by (At; t) in the hash table. Given Nt, implementing MIN is easy: we read At and Nt in parallel, and hence for each item requested, we know when it will be requested next. We tag each item in the cache with the time when it will be requested next, and if necessary, evict the item with the highest value for its next request, using a heap to identify itquickly.5.2 ResultsWe present the results for only one crawling host. The results for the other three hosts are quasi-identical. Figure 2 shows the miss rate over the entire trace (that is, the percentage of misses out of all requests to the cache) as a function of the size of the。

相关文档
最新文档