nutch_1.2

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Nutch实验报告
1.1安装Nutch
1.1.1实验环境
Linux操作系统，采用的是VMWare虚拟机下的Ubuntu 10.10系统。

1.1.2安装的必要软件
1.JDK，采用的是JDK1.6版本，此次实验的版本是
jdk-6u24-linux-i586.bin。

/technetwork/java/javase/downloads/index.html
2.Tomcat，采用的是apache-tomcat-7.0版本，此次实验的版本是
apache-tomcat-7.0.11.tar.gz。

/
3.Nutch, 采用的是apache-nutch-1.2版本，此次实验的版本是
apache-nutch-1.2-bin.tar.gz。

/
1.1.3软件的安装方法和安装过程
1.安装JDK
1)将下载好的jdk的bin文件放到虚拟机环境下，例如，可以放在
/home/root
2)到jdk的bin文件目录下，执行命令，安装jdk。

安装命令：
[root@ubuntu:/home/root]#sh jdk-6u24-linux-i586.bin
3)修改环境变量
编辑~/.bashrc文件
[root@ubuntu:~]# vi ~/.bashrc
在最后加入如下配置
export JA V A_HOME=/home/root/jdk1.6.0_24
export JA V A_BIN=/home/root/jdk1.6.0_24/bin
export PATH=$PATH:$JA V A_HOME/bin
export CLASSPATH=.:$JA V A_HOME/lib/dt.jar:$JA V A_HOME/lib/tools.jar 注意：在网上看其他资料是配置:/etc/environment
但在配置后重启，出现无法进入系统的状况。

4)查看jdk是否安装成功
命令如下：
[root@ubuntu:~]# java –version
如果出现下列结果，jdk安装成功。

2.安装nutch
1)将下载好的文件apache-nutch-1.2-bin.tar.gz放到虚拟机环境下，在
apache-nutch-1.2-bin.tar.gz文件目录下解压，本例中文件放在/home/root目录下。

命令如下：
[root@ubuntu:/home/root]# tar zxvf apache-nutch-1.2-bin.tar.gz
2)重命名为nutch。

命令如下：
[root@ubuntu:/home/root]# mv 原文件名称nutch
3)Nutch命令的测试，nutch文件夹目录下。

命令如下：
[root@ubuntu:/home/root/nutch]# bin/nutch
3.安装tomcat
1)将下载好的文件apache-tomcat-7.0.11.tar.gz放到虚拟机环境下，在
apache-tomcat-7.0.11.tar.gz文件目录下解压，本例中文件放在/home/root目录下。

命令如下：
[root@ubuntu:/home/root]# tar zvxf apache-tomcat-6.0.32.tar.gz
2)重命名为tomcat。

命令如下：
[root@ubuntu:/home/root/]# mv 原文件名称tomcat
3)将nutch自带的.war文件拷贝到tomcat的webapps文件夹下。

命令如下：
[root@ubuntu:/home/root/]# cd tomcat/webapps
[root@ubuntu:~]#rm –rf ROOT*
[root@ubuntu:~webapps]# cp ../../nutch/nutch*.war ROOT.war
4)启动和关闭tomcat，在bin目录下，start catalina.sh。

命令如下：
[root@ubuntu:/home/root/tomcat/bin]# ./startup.sh
[root@ubuntu:/home/root/tomcat/bin]# ./shutdown.sh
5)在浏览器中，打开http://localhost:8080（其中8080是tomcat的默认
端口号）。

1.2Nutch 爬行实验
1.2.1简单的爬行实验过程
1.对nutch进行配置
1)在nutch目录下进行，增加抓取页面，在本次实验中采用的是。

命令如下：
[root@ubuntu:/home/root/nutch]# mkdir urls
[root@ubuntu:/home/root/nutch]# echo />>urls/163
该命令会自动在urls目录中建立163文件并保存/
信息。

2)改写nutch/conf目录下的crawl-urlfilter.txt文件，设定抓取信息。

命令如下：
[root@ubuntu:/home/root/nutch]# gedit conf/crawl-urlfilter.txt
修改为
# accept hosts in
+^http://([a-z0-9]*\.)*/
3)改写nutch/conf目录下的nutch-default.xml文件，设定搜索目录，例
如，搜索目录名称定为crawl，可以自定义名称，但是要与下面的tomcat
设置相同。

命令如下：
[root@ubuntu:/home/root/nutch]# gedit conf/nutch-default.xml
修改文件中的搜索目录。

<property>
<name>searcher.dir</name>
<value>crawl</value>
</property>
2.对tomcat进行配置
1)设定搜索目录，要与nutch-default.xml中设定的相同，此时要在tomcat
目录下进行设置。

命令如下：
[root@ubuntu:/home/root/tomcat]#gedit
webapps/ROOT/WEB-INF/classes/nutch-site.xml
添加或修改部分代码，如下：
<configuration>
<property>
<name>searcher.dir</name>
<value>/home/root/nutch/crawl<value>
</property>
</configuration>
2)配置代理
编辑conf/nutch-site.xml文件，增加代理的属性，并编辑相应的属性值。

<configuration>
<property>
<name></name>
<value>my nutch agent</value>
</property>
<property>
<name>http.agent.version</name>
<value>1.0</value>
</property>
</configuration>
3)中文乱码
nutch对中文的支持还不完善，需要修改tomcat文件夹下conf/server.xml文件[root@ubuntu:]#vi conf/server.xml
增加两句，修改为
<Connector port="8080"
maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
enableLookups="false" redirectPort="8443" acceptCount="100"
connectionTimeout="20000" disableUploadTimeout="true"
URIEncoding="UTF-8" useBodyEncodingForURI="true" />
(文件语句较多,只要找准Connector port=＂8080＂即可)
3.开始抓取网站，以为例。

1)爬行命令，在nutch文件目录下，进行。

命令如下：
[root@ubuntu:/home/root/nutch]# bin/nutch crawl urls -dir crawl -depth 2 -threads 4 -topN 50 >& crawl.log
urls是存放163网址的文件夹目录
-dir crawl.demo是抓取的页面的存放目录,与3.1.2中的设定搜索目录是对应的
-depth指爬行的深度，这里处于测试的目的，选择深度为2 ，完全爬行一般可设定为10左右
-threads指定并发的进程这是设定为4
-topN指在每层的深度上所要抓取的最大的页面数，完全抓取可设定为1万到100万，这取决于网站资源数量
1.3Nutch爬行实验结果
1.3.1实验结果
1.在crawl.log中记录如下。

crawl started in: crawl.demo3
rootUrlDir = urls
threads = 4
depth = 2
indexer=lucene
topN = 50
Injector: starting
Injector: crawlDb: crawl.demo3/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl.demo3/segments/20110311092949
Generator: done.
Fetcher: Your '' value should be listed first in
'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: crawl.demo3/segments/20110311092949
Fetcher: threads: 4
QueueFeeder finished: total 1 records + hit by time limit :0
fetching /
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl.demo3/crawldb
CrawlDb update: segments: [crawl.demo3/segments/20110311092949] CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness.
Generator: segment: crawl.demo3/segments/20110311093014 Generator: done.
Fetcher: Your '' value should be listed first in 'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: crawl.demo3/segments/20110311093014
Fetcher: threads: 4
QueueFeeder finished: total 50 records + hit by time limit :0
fetching /
fetching
/bill/s2010/jiedong/mkt/neitui/4501050708.swf fetching /
fetching /
-activeThreads=4, spinWaiting=0, fetchQueues.totalSize=46
fetching /bill/s2011/lilixing/dianxin/4501050211.swf Error parsing: /bill/s2010/jiedong/mkt/neitui/4501050708.swf:
failed(2,0): Can't retrieve Tika parser for mime-type application/x-shockwave-flash
fetching /
-activeThreads=4, spinWaiting=0, fetchQueues.totalSize=44
Error parsing: /bill/s2011/lilixing/dianxin/4501050211.swf:
failed(2,0): Can't retrieve Tika parser for mime-type application/x-shockwave-flash
fetching /
fetching /
fetching /
-activeThreads=4, spinWaiting=0, fetchQueues.totalSize=41
fetching /
fetching /videoshow/33533.html fetching /
fetching /
-activeThreads=4, spinWaiting=0, fetchQueues.totalSize=37 fetching /
fetching /
fetching /
fetching /
-activeThreads=4, spinWaiting=0, fetchQueues.totalSize=33 fetching /
fetching /
fetching /
fetching /
-activeThreads=4, spinWaiting=0, fetchQueues.totalSize=29 fetching /
fetching /
fetching /
fetching /
fetching /20110303/n279634880.shtml -activeThreads=4, spinWaiting=0, fetchQueues.totalSize=24 fetching /
fetching /
-activeThreads=4, spinWaiting=0, fetchQueues.totalSize=22 fetching /
fetching /
fetching /
-activeThreads=4, spinWaiting=0, fetchQueues.totalSize=19 fetching /
fetching /passport/pp18030_31.js
-activeThreads=4, spinWaiting=0, fetchQueues.totalSize=17 fetching /
fetching /mobile.shtml
-activeThreads=4, spinWaiting=0, fetchQueues.totalSize=15 fetching /
fetching /
-activeThreads=4, spinWaiting=0, fetchQueues.totalSize=13 fetching /
-activeThreads=4, spinWaiting=0, fetchQueues.totalSize=12 -activeThreads=4, spinWaiting=0, fetchQueues.totalSize=12 -activeThreads=4, spinWaiting=0, fetchQueues.totalSize=12 fetching /
-activeThreads=4, spinWaiting=0, fetchQueues.totalSize=11 fetching /
fetching /
fetching /
fetching /
fetching /
-activeThreads=4, spinWaiting=0, fetchQueues.totalSize=6
fetching /
fetching /20110302/n279605838.shtml
fetching /));163Flash1.write(
fetching /passport/pi18030.201011300952.js fetching /
-activeThreads=4, spinWaiting=0, fetchQueues.totalSize=1
* queue:
maxThreads = 1
inProgress = 0
crawlDelay = 1000
minCrawlDelay = 0
nextFetchTime = 1299807018652
now = 1299807036805
0. /
fetching /
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl.demo3/crawldb
CrawlDb update: segments: [crawl.demo3/segments/20110311093014] CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl.demo3/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/home/root/nutch/crawl.demo3/segments/20110311093014
LinkDb: adding segment: file:/home/root/nutch/crawl.demo3/segments/20110311092949
LinkDb: done
Indexer: starting
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl.demo3/indexes
Dedup: done
merging indexes to: crawl.demo3/index
Adding file:/home/root/nutch/crawl.demo3/indexes/part-00000
done merging
crawl finished: crawl.demo3
2.浏览器测试结果。

输入163，进行搜索，结果如下：
1.4实验心得和问题解决方案
1.4.1问题的解决方案。

1.JDK 环境配置。

编辑~/.bashrc文件
[root@ubuntu:~]# vi ~/.bashrc
在最后加入如下配置
export JA V A_HOME=/home/root/jdk1.6.0_24
export JA V A_BIN=/home/root/jdk1.6.0_24/bin
export PATH=$PATH:$JA V A_HOME/bin
export
CLASSPATH=.:$JA V A_HOME/lib/dt.jar:$JA V A_HOME/lib/tools.jar
注意：在网上看其他资料是配置:/etc/environment
但在配置后重启，出现无法进入系统的状况，导致最后必须重装操作系统。

2.Nutch
1)以下两条命令很容易就忘记做，导致不会出现搜索结果
[root@ubuntu:/home/root/tomcat]#gedit
webapps/ROOT/WEB-INF/classes/nutch-site.xml
[root@ubuntu:/home/root/nutch]# gedit conf/nutch-default
.xml
修改文件中的搜索目录。

<property>
<name>searcher.dir</name>
<value>crawl</value>
</property>
2)设置代理
不配置代理会造成无法找到代理的错误。

3)索引目录不能有数字
本实验用的索引目录名是crawl，目录名中不能包含数字，否则也会
出错。

1.4.2实验心得
整个实验在进行软件安装和软件配置的步骤繁琐，忽略了任意一点就有可能导致整个实验过程无法进行。

总结出检查各个软件的安装和配置的最好方法，就是学会查看各个log文件，在log文件中找出问题的所在，进行改进。

以上的总结只是对nutch的安装，配置和抓取过程进行了简要的描述，对于nutch的使用不能仅局限于此，nutch中还有很多的指令和抓取方法，还要继续学习。