ITPub博客

首页 > 大数据 > Hadoop > tomcat+solr+mysql+nutch2.2在debian下安装

tomcat+solr+mysql+nutch2.2在debian下安装

Hadoop 作者:gaoqiangz 时间:2013-10-03 10:19:31 0 删除 编辑
1、核心参考网站
http://yangshangchuan.iteye.com/blog/1837935
http://blog.csdn.net/okman1214/article/details/8831274
http://yangshangchuan.iteye.com/blog/1837935
http://blog.csdn.net/okman1214/article/details/8831274
http://wiki.apache.org/nutch/NutchTutorial#Set_up_from_the_source_distribution
http://nlp.solutions.asia/?p=362
http://blog.csdn.net/weijonathan/article/details/9197697
http://blog.csdn.net/wokagoka/article/details/8581874
http://blog.csdn.net/okman1214/article/details/8831274
http://yangshangchuan.iteye.com/blog/1837935
2、Ubuntu Linux下jdk的安装与配置
(1)安装JDK1.6。
从sun公司网站www.sun.com下载linux版本的jdk,
地址http://java.sun.com/javase/downloads/index.jsp
jdk-6u45-linux-x64.bin,注意一般服务器多是64位的,虚拟机也应该是64位的。
(2)用cd命令进入jdk所在目录,执行复制命令
cp jdk-6u45-linux-x64.bin /usr
即将jdk复制到/usr目录下,然后进入/usr目录cd /usr
执行权限
chmod +x jdk-6u45-linux-x64.bin
(3).执行安装命令
./jdk-6u45-linux-x64.bin
(4)修改环境变量,最好配置在环境变量文件里(vi /root/.bashrc或vi /etc/profile)。
export JAVA_HOME=/usr/jdk1.6.0_45
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
3、安装并配置mysql,最好安装mysql5.5版本以上,参考安装mysql5.5以上版本.然后做如下配置。
3.1、vi /etc/my.cnf,在[mysqld]下增加
innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
character-set-server=utf8mb4
collation-server=utf8_unicode_ci
修改为max_allowed_packet=500M
3.2、We need to set up the nutch database manually as the current Nutch/Gora/MySQL
generated db schema defaults to latin. Log into mysql at the command line using
your previously set up MySQL id and password type:
/usr/local/mysql/bin/mysql -u root -p
3.3、登录mysql,然后在mysql下输入:
in the MySQL editor type the following:
CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;
 回车。
3.4、在mysql下输入:
use nutch;
3.5、在mysql下输入下列命令:
  copy and paste the following altogether:
CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` longtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`prevModifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED
DEFAULT CHARSET=utf8mb4;
回车,注意表的名称和后文中的配置相一致。注:表中的字段根据nutch的conf文件“gora-sql-mapping”进行设置。
jdbc:mysql://127.0.0.1:3306/nutch
4、下载nutch2.2.1
    Download a source package apache-nutch-2.2.1-src.tar.gz
    tar zxvf apache-nutch-2.2.1-src.tar.gz
    cd apache-nutch-2.2.1
5、From inside the nutch folder ensure the MySQL dependency for Nutch is available
by editing the following in ${APACHE_NUTCH_HOME}/ivy/ivy.xml
change
default”/>
to
default”/>
and uncomment the gora-sql
default”/>
and uncomment the mysql connector

default”/>
6、Edit the ${APACHE_NUTCH_HOME}/conf/gora.properties file either deleting or
commenting out the Default SqlStore Properties using #. Then add the MySQL
properties below replacing xxxxx with the user and password you set up when
installing MySQL earlier.
###############################
# MySQL properties #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://127.0.0.1:3306/nutch
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=XXXXXX
注意:jdbc.url值中的nutch为MySQL中数据库的名字,你可以根据自己的需要设置数据库名。前提是你要在MySQL中创建数据库。
6、Edit the ${APACHE_NUTCH_HOME}/conf/gora-sql-mapping.xml file changing the length of the primarykey from 512 to 767 in both places.

7、Configure ${APACHE_NUTCH_HOME}/conf/nutch-site.xml to put in a name in the value
field under http.agent.name. It can be anything but cannot be left blank. Add
additional languages if you want (I have added Japanese ja-jp below) and utf-8 as
default as well. You must specify Sqlstore.


 generate.batch.id
 servst


 http.agent.name
 My Nutch Spider


parser.character.encoding.default
utf-8
The character encoding to fall back to when no other information
is available


  storage.data.store.class
  org.apache.gora.sql.store.SqlStore


“nutch-site”文件需要保存为utf-8格式
8、编译
    apt-get install ant #如果没有安装ant,则运行这个命令。
    ant #Run ant in this folder,
    在目录 runtime/local下包含编译和可以使用的nutch,config files should be modified in apache-nutch/runtime/local/conf/
    ant clean will remove this directory (keep copies of modified config files)
   下载mysql-connector-java-5.1.26.tar.gz,解压后将mysql-connector-java-5.1.26-bin.jar拷贝到/root/apache-nutch-2.2.1/runtime/local/lib/
9、测试编译后的nutch
  cd ${APACHE_NUTCH_HOME}/runtime/local
  mkdir -p urls
  cd urls
  touch seed.txt # create a text file seed.txt under urls/ with the following content (one   URL per line for each site you want Nutch to crawl).
  http://www.sina.com/
  vi conf/regex-urlfilter.txt and replace
# accept anything else
+.
with a regular expression matching the domain you wish to crawl.
For example, if you wished to limit the crawl to the nutch.apache.org domain, the line should read:
 +^http://([a-z0-9]*.)*sina.com/
This will include any URL in the domain nutch.apache.org.
10、运行
在运行前要确保mysql已经启动(/usr/local/mysql/bin/mysqld --user=mysql)
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
有时候需要在/runtime/local建立子文件夹 mkdir crawl
11、排错
通过在Logs目录下面nutch_log.log和hadoop.log,查看错误原因。
可能的错误在
http://blog.csdn.net/weijonathan/article/details/9197697
可能已经列出。
最常见的错误是:
查看GeneratorReducer第100行代码如下:
batchId = newUtf8(conf.get(GeneratorJob.BATCH_ID));
可以看到是获取GeneratorJob.BATCH_ID。也就是generate.batch.id这个值的时候报空了!
解决方法:
方法1:在nutch-site.xml中添加generate.batch.id配置项,value不为空即可;但是这种做法不是很好,因为查看源码里面batchId是用随机数生成的。可能有其他地方有限制。
方法2:修改GeneratorJob中的public Map run(Map args)
方法。
添加以下三行
    // generate batchId  
       int randomSeed = Math.abs(new Random().nextInt());  
       String batchId = (curTime / 1000) + "-" + randomSeed;  
       getConf().set(BATCH_ID, batchId);  
12、再次测试
Start crawling (you will want to create your own script later but manually just to see what is happening type the following into the command line).

bin/nutch inject urls
bin/nutch generate -topN 20
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb
13、Check your crawl results by looking at the webpage table in the nutch database.
登录mysql
use nutch;
SELECT * FROM nutch.webpage;

select * from webpage where status = 2;
14、安装solr4.4
14.1、下载得到的solr-4.4.0.zip,不要下载solr-4.4.0-src.tgz。
14.2、apt-get install unzip
14.3、cd ${APACHE_SOLR_HOME}/example
   java -jar start.jar
14.4、在浏览器输入地址http://localhost:8983/solr  测试是否启动成功。http://118.228.40.99:8983/solr
14.5、bin/nutch solrindex http://118.228.40.99:8983/solr/ -reindex
14.6、在浏览器输入地址
   http://118.228.40.99:8983/solr/#/collection1/query
15、如果希望使用tomcat作为Web服务器,则第14步骤就不需要了,其安装方法方法:
参考文献
http://blog.csdn.net/zhyh1986/article/details/9856115
15.1、下载apache-tomcat-7.0.42.tar.gz
   tar zxvf apache-tomcat-7.0.42.tar.gz /usr/local/tomcat
   mv apache-tomcat-7.0.42 /usr/local/tomcat
   export TOMCAT_HOME=/usr/local/tomcat
   /usr/local/tomcat/startup.sh
   在浏览器中输入http://118.228.40.99:8080测试是否安装成功。
15.2、将exmaple/webapps中的solr.war包解压开复制到Tomcat的webapps中,或者直接将solr.war包直接复制到Tomcat的webapps中,然后启动Tomcat使其解压开再将war包删除
15.3、
mv solr.war /usr/local/tomcat/webapps/
vi WEB-INF/web.xml
http://118.228.40.99:8080/solr
15.4、vi solr/WEB-INF/web.xml
   
       solr/home
       /usr/local/tomcat/webapps/solr/solr_home
       java.lang.String
   
15.5、建立solr_home目录
将example/solr下的内容拷贝到/usr/local/tomcat/webapps/solr/solr_home
15.6、将solr下的examplelibext的jar拷贝到tomcat下的lib下
      重新启动tomcat,在浏览器中输入:http://118.228.40.99:8080/solr
16、祝好运
<!-- 正文结束 -->

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/22959803/viewspace-1119583/,如需转载,请注明出处,否则将追究法律责任。

上一篇: 没有了~
下一篇: 没有了~
请登录后发表评论 登录
全部评论

注册时间:2009-12-02