windows上搭建自己的搜索引擎nutch
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
windows上搭建自己的搜索引擎nutch
nutch windows install guider
--By Liming Liu
1 Install Cygwin
2 Install JDK
3 Install Tomcat
4 Pre-Install nutch
5 Configure and run nutch
6 Begin search
7 Referece
1 Install Cygwin
Download and install the latest version, must select GCC while selecting packages.
2 Install JDK
Download jdk-1_5_0_06-windows-i586-p.exe and install(acquiescently, C:/Program Files/Java/jdk1.5.0_06 ).
Set environmental variable: NUTCH_JAVA_HOME: C:/Program
Files/Java/jdk1.5.0_06
JAVA_HOME: C:/Program Files/Java/jdk1.5.0_06
3 Install Tomcat
Download apache-tomcat-6.0.13.exe and install(acquiescently, C:/Program
Files/Apache Software Foundation/Tomcat 6.0).Remember the port, account and password.
4 Pre-Install nutch
Download nutch-0.9.tar.gz and unzip to nutch-0.9(such as
C:/dev/search/netch/nutch-0.9).
Start Tomcat service, open http://localhost:8080/manager/html
Move to “WAR file to deploy”, upload file:C:/dev/search/netch/nutch-0.9/nutch-0.9.war.
Close Tomcat service, change directory name “ROOT” in “C:/Program Files/Apache Software Foundation/Tomcat 6.0/webapps” to “ ROOT-backup”, change directory name “nutch-0.9” in “C:/Program Files/Apache Software Foundation/To mcat
6.0/webapps” to “ ROOT”.( OR do nothing)
5 Configure and run nutch
Create directory “urls” in “C:/dev/search/netch/nutch-0.9”.
Create a file “testurlfile” in directory “urls”.
Add line: ““ t o file “testurlfile”.
Find file “C:/dev/search/netch/nutch-0.9/conf/ crawl-urlfilter.txt”, replace
“” with “”
Find file “C:/dev/search/netch/nutch-0.9/conf/ nutch-site.xml”, edit it to this:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name></name>
<value>nutch</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value>liming agent.description</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name. </description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value>agent.email</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
</configuration>
Find file “C:/Program Files/Apache S oftware Foundation/Tomcat
6.0/webapps/ROOT/WEB-INF/classes/”, edit it to this:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>searcher.dir</name>
<value>C:/dev/search/netch/nutch-0.9/crawl.demo</value>
</property>
</configuration>
Find file “C:/Program Files/Apache Software Foundation/Tomcat
6.0/conf/server.xml”.Edit the item“<Connector port="8080"…/>” to this:
<Connector port="8080"maxThreads="150"minSpareThreads="25"maxSpareThrea ds="75"enableLookups="false"redirectPort="8443"acceptCount="100"debug="0" connectionTimeout="20000"disableUploadTimeout="true"URIEncoding="UTF-8"/ >
Start tomcat service.
Start cygwin, cd to “C:/dev/search/netch/nutch-0.9”,
run: bin/nutch crawl urls -dir crawl.demo -depth 2 -topN 50
6 Begin search
Open http://localhost:8080 with internet explorer, you will see a real search engine. (Or http://localhost:8080/nutch)
7 Referece
/topic/81627 Nutch_0.8实践 (1) X.D.Hua
/club/simple/index.php?t312.html Nutch 于 winxp Kevin /pwlazy/archive/2006/08/23/1109868.aspx windows下nutch0.8初探pwlazy。