Nutch相关框架安装使用最佳指南
Chinese installing and using instruction - The best guidance in installing and using Nutch in China
土豆在线观看地址: 超清原版下载地址: 超清压缩下载地址: 一、nutch1.2 二、nutch1.5.1 三、nutch2.0 四、配置SSH 五、安装Hadoop Cluster(伪分布式运行模式)并运行Nutch 六、安装Hadoop Cluster(分布式运行模式)并运行Nutch 七、配置Ganglia监控Hadoop集群和HBase集群 八、Hadoop配置Snappy压缩 九、Hadoop配置Lzo压缩 十、配置zookeeper集群以运行hbase 十一、配置Hbase集群以运行nutch-2.1(Region Servers会因为内存的问题宕机) 十二、配置Accumulo集群以运行nutch-2.1(gora存在BUG) 十三、配置Cassandra 集群以运行nutch-2.1(Cassandra 采用去中心化结构) 十四、配置MySQL 单机服务器以运行nutch-2.1 十五、nutch2.1 使用DataFileAvroStore作为数据源 十六、nutch2.1 使用AvroStore作为数据源 十七、配置SOLR 十八、Nagios监控 十九、配置Splunk 二十、配置Pig 二十一、配置Hive 二十二、配置Hadoop2.x集群 一、nutch1.2 步骤和二大同小异,在步骤 5、配置构建路径 中需要多两个操作:在左部Package Explorer的 nutch1.2文件夹上单击右键 > Build Path > Configure Build Path... > 选中Source选项 > Default output folder:修改nutch1.2/bin为nutch1.2/_bin,在左部Package Explorer的 nutch1.2文件夹下的bin文件夹上单击右键 > Team > 还原 二中黄色背景部分是版本号的差异,红色部分是1.2版本没有的,绿色部分是不一样的地方,如下: 1、Add JARs... > nutch1.2 > lib ,选中所有的.jar文件 > OK 2、crawl-urlfilter.txt 3、将crawl -urlfilter.txt.template改名为crawl -urlfilter.txt 4、修改crawl-urlfilter.txt,将 # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
# skip everything else -. 5、cd /home/ysc/workspace/nutch1.2 nutch1.2是一个完整的搜索引擎,nutch1.5.1只是一个爬虫。nutch1.2可以把索引提交给SOLR,也可以直接生成LUCENE索引,nutch1.5.1则只能把索引提交给SOLR: 1、cd /home/ysc 2、wget 3、tar -xvf apache-tomcat-7.0.29.tar.gz 4、在左部Package Explorer的 nutch1.2文件夹下的build.xml文件上单击右键 > Run As > Ant Build... > 选中war target > Run 5、cd /home/ysc/workspace/nutch1.2/build 6、unzip nutch-1.2.war -d nutch-1.2 7、cp -r nutch-1.2 /home/ysc/apache-tomcat-7.0.29/webapps 8、vi /home/ysc/apache-tomcat-7.0.29/webapps/nutch-1.2/WEB-INF/classes/nutch-site.xml 加入以下配置: <property> <name>searcher.dir</name> <value>/home/ysc/workspace/nutch1.2/data</value> <description> Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory "index" containing merged indexes, or the directory "segments" containing segment indexes. </description> </property> 9、vi /home/ysc/apache-tomcat-7.0.29/conf/server.xml 将 <Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443"/> 改为 <Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" URIEncoding="utf-8"/>
10、cd /home/ysc/apache-tomcat-7.0.29/bin 11、./startup.sh 12、访问:
关于nutch1.2更多的BUG修复及资料,请参看我在CSDN发布的资源:
二、nutch1.5.1 1、下载并解压eclipse(集成开发环境) 下载地址: ,下载Eclipse IDE for Java EE Developers 2、安装Subclipse插件(SVN客户端) 插件地址: , 3、安装IvyDE插件(下载依赖Jar) 插件地址: 4、签出代码 File > New > Project > SVN > 从SVN 检出项目 创建新的资源库位置 > URL: > 选中URL > Finish 弹出New Project向导,选择Java Project > Next,输入Project name:nutch1.5.1 > Finish 5、配置构建路径 在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > Build Path > Configure Build Path... > 选中Source选项 > 选择src > Remove > Add Folder... > 选择src/bin, src/java, src/test 和 src/testresources(对于插件,需要选中src/plugin目录下的每一个插件目录下的src/java , src/test文件夹) > OK 切换到Libraries选项 > Add Class Folder... > 选中nutch1.5.1/conf > OK Add JARs... > 需要选中src/plugin目录下的每一个插件目录下的lib目录下的jar文件 > OK Add Library... > IvyDE Managed Dependencies > Next > Main > Ivy File > Browse > ivy/ivy.xml > Finish 切换到Order and Export选项> 选中conf > Top 6、执行ANT 在左部Package Explorer的 nutch1.5.1文件夹下的build.xml文件上单击右键 > Run As > Ant Build 在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > Refresh 在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > Build Path > Configure Build Path... > 选中Libraries选项 > Add Class Folder... > 选中build > OK 7、修改配置文件nutch-site.xml 和regex-urlfilter.txt 将nutch-site.xml.template改名为nutch-site.xml 将regex-urlfilter.txt.template改名为regex-urlfilter.txt 在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > Refresh 将如下配置项加入文件nutch-site.xml: <property> <name>http.agent.name</name> <value>nutch</value> </property> <property> <name>http.content.limit</name> <value>-1</value> </property> 修改regex-urlfilter.txt,将 # accept anything else +. 替换为: +^http://([a-z0-9]*\.)*news.163.com/ -. 8、开发调试 在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > New > Folder > Folder name: urls 在刚新建的urls目录下新建一个文本文件url,文本内容为: 打开src/java下的org.apache.nutch.crawl.Crawl.java类,单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: urls -dir data -depth 3 > Run 在需要调试的地方打上断点Debug As > Java Applicaton 9、查看结果 查看segments目录: 打开src/java下的org.apache.nutch.segment.SegmentReader.java类 单击右键Run As > Java Applicaton,控制台会输出该命令的使用方法 单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: -dump data/segments/* data/segments/dump 用文本编辑器打开文件data/segments/dump/dump查看segments中存储的信息
查看crawldb目录: 打开src/java下的org.apache.nutch.crawl.CrawlDbReader.java类 单击右键Run As > Java Applicaton,控制台会输出该命令的使用方法 单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: data/crawldb -stats 控制台会输出 crawldb统计信息
查看linkdb目录: 打开src/java下的org.apache.nutch.crawl.LinkDbReader.java类 单击右键Run As > Java Applicaton,控制台会输出该命令的使用方法 单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: data/linkdb -dump data/linkdb_dump 用文本编辑器打开文件data/linkdb_dump/part-00000查看linkdb中存储的信息 10、全网分步骤抓取 在左部Package Explorer的 nutch1.5.1文件夹下的build.xml文件上单击右键 > Run As > Ant Build cd /home/ysc/workspace/nutch1.5.1/runtime/local #准备URL列表 wget gunzip content.rdf.u8.gz mkdir dmoz bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/url #注入URL bin/nutch inject crawl/crawldb dmoz #生成抓取列表 bin/nutch generate crawl/crawldb crawl/segments #第一次抓取 s1=`ls -d crawl/segments/2* | tail -1` echo $s1 #抓取网页 bin/nutch fetch $s1 #解析网页 bin/nutch parse $s1 #更新URL状态 bin/nutch updatedb crawl/crawldb $s1 #第二次抓取 bin/nutch generate crawl/crawldb crawl/segments -topN 1000 s2=`ls -d crawl/segments/2* | tail -1` echo $s2 bin/nutch fetch $s2 bin/nutch parse $s2 bin/nutch updatedb crawl/crawldb $s2 #第三次抓取 bin/nutch generate crawl/crawldb crawl/segments -topN 1000 s3=`ls -d crawl/segments/2* | tail -1` echo $s3 bin/nutch fetch $s3 bin/nutch parse $s3 bin/nutch updatedb crawl/crawldb $s3 #生成反向链接库 bin/nutch invertlinks crawl/linkdb -dir crawl/segments
11、索引和搜索 cd /home/ysc/ wget tar -xvf apache-solr-3.6.1.tgz cd apache-solr-3.6.1 /example NUTCH_RUNTIME_HOME=/home/ysc/workspace/nutch1.5.1/runtime/local APACHE_SOLR_HOME=/home/ysc/apache-solr-3.6.1
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/ 如果需要把网页内容存储到索引中,则修改 schema.xml文件中的 <field name="content" type="text" stored="false" indexed="true"/> 为 <field name="content" type="text" stored="true" indexed="true"/>
修改${APACHE_SOLR_HOME}/example/solr/conf/solrconfig.xml,将里面的<str name="df">text</str>都替换为<str name="df">content</str>
把${APACHE_SOLR_HOME}/example/solr/conf/schema.xml中的 <schema name="nutch" version="1.5.1">修改为<schema name="nutch" version="1.5"> #启动SOLR服务器 java -jar start.jar
cd /home/ysc/workspace/nutch1.5.1/runtime/local #提交索引 bin/nutch solrindex crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
执行完整crawl: bin/nutch crawl urls -dir data -depth 2 -topN 100 -solr
使用以下命令分页查看所有索引的文档: *%3A*&version=2.2&start=0&rows=10&indent=on 标题包含“网易”的文档:
12、查看索引信息 cd /home/ysc/ wget java -jar lukeall-3.5.0.jar Path: /home/ysc/apache-solr-3.6.1/example/solr/data
13、配置SOLR的中文分词 cd /home/ysc/ wget unzip mmseg4j-1.8.5.zip -d mmseg4j-1.8.5 APACHE_SOLR_HOME=/home/ysc/apache-solr-3.6.1 mkdir $APACHE_SOLR_HOME/example/solr/lib mkdir $APACHE_SOLR_HOME/example/solr/dic cp mmseg4j-1.8.5/mmseg4j-all-1.8.5.jar $APACHE_SOLR_HOME/example/solr/lib cp mmseg4j-1.8.5/data/*.dic $APACHE_SOLR_HOME/example/solr/dic 将${APACHE_SOLR_HOME}/example/solr/conf/schema.xml文件中的 <tokenizer class="solr.WhitespaceTokenizerFactory"/> 和 <tokenizer class="solr.StandardTokenizerFactory"/> 替换为 <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="/home/ysc/apache-solr-3.6.1/example/solr/dic"/> #重新启动SOLR服务器 java -jar start.jar
#重建索引,演示在开发环境中如何操作 打开src/java下的org.apache.nutch.indexer.solr.SolrIndexer.java类 单击右键Run As > Java Applicaton,控制台会输出该命令的使用方法 单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: ; data/crawldb -linkdb data/linkdb data/segments/* 使用luke重新打开索引就会发现分词起作用了
三、nutch2.0 nutch2.0和二中的nutch1.5.1的步骤相同,但在8、开发调试之前需要做以下配置: 在左部Package Explorer的 nutch2.0文件夹上单击右键 > New > Folder > Folder name: data并指定数据存储方式,选如下之一: 1、使用mysql作为数据存储 1)、在nutch2.0/conf/nutch-site.xml中加入如下配置: <property> <name>storage.data.store.class</name> <value>org.apache.gora.sql.store.SqlStore</value> </property> 2)、将nutch2.0/conf/gora.properties文件中的 gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest gora.sqlstore.jdbc.user=sa gora.sqlstore.jdbc.password= 修改为 gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver gora.sqlstore.jdbc.url=jdbc:mysql://127.0.0.1:3306/nutch2 gora.sqlstore.jdbc.user=root gora.sqlstore.jdbc.password=ROOT 3)、打开nutch2.0/ivy/ivy.xml中的mysql-connector-java依赖 4)、sudo apt-get install mysql-server 2、使用hbase作为数据存储 1)、在nutch2.0/conf/nutch-site.xml中加入如下配置: <property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> </property> 2)、打开nutch2.0/ivy/ivy.xml中的gora-hbase依赖 3)、cd /home/ysc 4)、wget 5)、tar -xvf hbase-0.90.5.tar.gz 6)、vi hbase-0.90.5/conf/hbase-site.xml 加入以下配置: <property> <name>hbase.rootdir</name> <value> file:///home/ysc/hbase-0.90.5-database</value> </property> 7)、hbase-0.90.5/bin/start-hbase.sh 8)、将/home/ysc/hbase-0.90.5/hbase-0.90.5.jar加入开发环境eclipse的build path
四、配置SSH 三台机器 devcluster01, devcluster02, devcluster03,分别在每一台机器上面执行如下操作: 1、sudo vi /etc/hosts 加入以下配置: 192.168.1.1 devcluster01 192.168.1.2 devcluster02 192.168.1.3 devcluster03 2、安装SSH服务: sudo apt-get install openssh-server 3、(有提示的时候回车键确认) ssh-keygen -t rsa 该命令会在用户主目录下创建 .ssh 目录,并在其中创建两个文件:id_rsa 私钥文件。是基于 RSA 算法创建。该私钥文件要妥善保管,不要泄漏。id_rsa.pub 公钥文件。和 id_rsa 文件是一对儿,该文件作为公钥文件,可以公开。 4、cp .ssh/id_rsa.pub .ssh/authorized_keys 把 三台机器 devcluster01, devcluster02, devcluster03 的文件/home/ysc/.ssh/authorized_keys的内容复制出来合并成一个文件并替换每一台机器上的/home/ysc/.ssh/authorized_keys文件 在devcluster01上面执行时,以下两条命令的主机为02和03 在devcluster02上面执行时,以下两条命令的主机为01和03 在devcluster03上面执行时,以下两条命令的主机为01和02 5、ssh-copy-id -i .ssh/id_rsa.pub ysc@ devcluster02 6、ssh-copy-id -i .ssh/id_rsa.pub ysc@ devcluster03 以上两条命令实际上是将 .ssh/id_rsa.pub 公钥文件追加到远程主机 server 的 user 主目录下的 .ssh/authorized_keys 文件中。
五、安装Hadoop Cluster(伪分布式运行模式)并运行Nutch 步骤和四大同小异,只需要1台机器 devcluster01,所以黄色背景部分全部设置为devcluster01,不需要第11步
六、安装Hadoop Cluster(分布式运行模式)并运行Nutch 三台机器 devcluster01, devcluster02, devcluster03(vi /etc/hostname) 使用用户ysc登陆 devcluster01: 1、cd /home/ysc 2、wget 3、tar -xvf hadoop-1.1.1-bin.tar.gz 4、cd hadoop-1.1.1 5、vi conf/masters 替换内容为 : devcluster01 6、vi conf/slaves 替换内容为 : devcluster02 devcluster03 7、vi conf/core-site.xml 加入配置: <property> <name>fs.default.name</name> <value>hdfs://devcluster01:9000</value> <description> Where to find the Hadoop Filesystem through the network. Note 9000 is not the default port. (This is slightly changed from previous versions which didnt have "hdfs") </description> </property> <property> <name>hadoop.security.authorization</name> <value>true</value> </property> 编辑conf/hadoop-policy.xml 8、vi conf/hdfs-site.xml 加入配置: <property> <name>dfs.name.dir</name> <value>/home/ysc/dfs/filesystem/name</value> </property>
<property> <name>dfs.data.dir</name> <value>/home/ysc/dfs/filesystem/data</value> </property>
<property> <name>dfs.replication</name> <value>1</value> </property>
<property> <name>dfs.block.size</name> <value>671088640</value> <description>The default block size for new files.</description> </property> 9、vi conf/mapred-site.xml 加入配置: <property> <name>mapred.job.tracker</name> <value>devcluster01:9001</value> <description> The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. Note 9001 is not the default port. </description> </property>
<property> <name>mapred.reduce.tasks.speculative.execution</name> <value>false</value> <description>If true, then multiple instances of some reduce tasks may be executed in parallel.</description> </property>
<property> <name>mapred.map.tasks.speculative.execution</name> <value>false</value> <description>If true, then multiple instances of some map tasks may be executed in parallel.</description> </property>
<property> <name>mapred.child.java.opts</name> <value>-Xmx2000m</value> </property>
<property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>4</value> <description> the core number of host </description> </property>
<property> <name>mapred.map.tasks</name> <value>4</value> </property>
<property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>4</value> <description> define mapred.map tasks to be number of slave hosts.the best number is the number of slave hosts plus the core numbers of per host </description> </property>
<property> <name>mapred.reduce.tasks</name> <value>4</value> <description> define mapred.reduce tasks to be number of slave hosts.the best number is the number of slave hosts plus the core numbers of per host </description> </property>
<property> <name>mapred.output.compression.type</name> <value>BLOCK</value> <description>If the job outputs are to compressed as SequenceFiles, how should they be compressed? Should be one of NONE, RECORD or BLOCK. </description> </property>
<property> <name>mapred.output.compress</name> <value>true</value> <description>Should the job outputs be compressed? </description> </property>
<property> <name>mapred.compress.map.output</name> <value>true</value> <description>Should the outputs of the maps be compressed before being sent across the network. Uses SequenceFile compression. </description> </property>
<property> <name>mapred.system.dir</name> <value>/home/ysc/mapreduce/system</value> </property>
<property> <name>mapred.local.dir</name> <value>/home/ysc/mapreduce/local</value> </property> 10、vi conf/hadoop-env.sh 追加: export JAVA_HOME=/home/ysc/jdk1.7.0_05 export HADOOP_HEAPSIZE=2000 #替换掉默认的垃圾回收器,因为默认的垃圾回收器在多线程环境下会有更多的wait等待 export HADOOP_OPTS="-server -Xmn256m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70" 11、复制HADOOP文件 scp -r /home/ysc/hadoop-1.1.1 ysc@devcluster02:/home/ysc/hadoop-1.1.1 scp -r /home/ysc/hadoop-1.1.1 ysc@devcluster03:/home/ysc/hadoop-1.1.1 12、sudo vi /etc/profile 追加并重启系统: export PATH=/home/ysc/hadoop-1.1.1/bin:$PATH 13、格式化名称节点并启动集群 hadoop namenode -format start-all.sh 14、cd /home/ysc/workspace/nutch1.5.1/runtime/deploy mkdir urls echo > urls/url hadoop dfs -put urls urls bin/nutch crawl urls -dir data -depth 2 -topN 100 15、访问 可以查看 JobTracker 的运行状态。访问 可以查看 TaskTracker 的运行状态。访问 可以查看 NameNode 以及整个分布式文件系统的状态,浏览分布式文件系统中的文件以及 log 等 16、通过stop-all.sh停止集群 17、如果NameNode和SecondaryNameNode不在同一台机器上,则在SecondaryNameNode的conf/hdfs-site.xml文件中加入配置: <property> <name>dfs.http.address</name> <value>namenode:50070</value> </property>
七、配置Ganglia监控Hadoop集群和HBase集群 1、服务器端(安装到master devcluster01上) 1)、ssh devcluster01 2)、addgroup ganglia adduser --ingroup ganglia ganglia 3)、sudo apt-get install ganglia-monitor ganglia-webfront gmetad //补充:在Ubuntu10.04上,ganglia-webfront这个package名字叫ganglia-webfrontend //如果install出错,则运行sudo apt-get update,如果update出错,则删除出错路径 4)、vi /etc/ganglia/gmond.conf 先找到setuid = yes,改成setuid =no; 在找到cluster块中的name,改成name =”hadoop-cluster”; 5)、sudo apt-get install rrdtool 6)、vi /etc/ganglia/gmetad.conf 在这个配置文件中增加一些datasource,即其他2个被监控的节点,增加以下内容: data_source “hadoop-cluster” devcluster01:8649 devcluster02:8649 devcluster03:8649 gridname "Hadoop" 2、数据源端(安装到所有slaves上) 1)、ssh devcluster02 addgroup ganglia adduser --ingroup ganglia ganglia sudo apt-get install ganglia-monitor 2)、ssh devcluster03 addgroup ganglia adduser --ingroup ganglia ganglia sudo apt-get install ganglia-monitor 3)、ssh devcluster01 scp /etc/ganglia/gmond.conf devcluster02:/etc/ganglia/gmond.conf scp /etc/ganglia/gmond.conf devcluster03:/etc/ganglia/gmond.conf 3、配置WEB 1)、ssh devcluster01 2)、sudo ln -s /usr/share/ganglia-webfrontend /var/www/ganglia 3)、vi /etc/apache2/apache2.conf 添加: ServerName devcluster01 4、重启服务 1)、ssh devcluster02 sudo /etc/init.d/ganglia-monitor restart ssh devcluster03 sudo /etc/init.d/ganglia-monitor restart 2)、ssh devcluster01 sudo /etc/init.d/ganglia-monitor restart sudo /etc/init.d/gmetad restart sudo /etc/init.d/apache2 restart 5、访问页面 http:// devcluster01/ganglia 6、集成hadoop 1)、ssh devcluster01 2)、cd /home/ysc/hadoop-1.1.1 3)、vi conf/hadoop-metrics2.properties # 大于0.20以后的版本用ganglia31 *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31 *.sink.ganglia.period=10 # default for supportsparse is false *.sink.ganglia.supportsparse=true *.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both *.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40 #广播IP地址,这是缺省的,统一设该值(只能用组播地址239.2.11.71) namenode.sink.ganglia.servers=239.2.11.71:8649 datanode.sink.ganglia.servers=239.2.11.71:8649 jobtracker.sink.ganglia.servers=239.2.11.71:8649 tasktracker.sink.ganglia.servers=239.2.11.71:8649 maptask.sink.ganglia.servers=239.2.11.71:8649 reducetask.sink.ganglia.servers=239.2.11.71:8649 dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 dfs.period=10 dfs.servers=239.2.11.71:8649 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 mapred.period=10 mapred.servers=239.2.11.71:8649 jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 jvm.period=10 jvm.servers=239.2.11.71:8649 4)、scp conf/hadoop-metrics2.properties root@devcluster02:/home/ysc/hadoop-1.1.1/conf/hadoop-metrics2.properties 5)、scp conf/hadoop-metrics2.properties root@devcluster03:/home/ysc/hadoop-1.1.1/conf/hadoop-metrics2.properties 6)、stop-all.sh 7)、start-all.sh 7、集成hbase 1)、ssh devcluster01 2)、cd /home/ysc/hbase-0.92.2 3)、vi conf/hadoop-metrics.properties(只能用组播地址239.2.11.71) hbase.extendedperiod = 3600 hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 hbase.period=10 hbase.servers=239.2.11.71:8649 jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 jvm.period=10 jvm.servers=239.2.11.71:8649 rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 rpc.period=10 rpc.servers=239.2.11.71:8649 4)、scp conf/hadoop-metrics.properties root@devcluster02:/home/ysc/ hbase-0.92.2/conf/hadoop-metrics.properties 5)、scp conf/hadoop-metrics.properties root@devcluster03:/home/ysc/ hbase-0.92.2/conf/hadoop-metrics.properties 6)、stop-hbase.sh 7)、start-hbase.sh
八、Hadoop配置Snappy压缩 1、wget 2、tar -xzvf snappy-1.0.5.tar.gz 3、cd snappy-1.0.5 4、./configure 5、make 6、make install 7、scp /usr/local/lib/libsnappy* devcluster01:/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/ scp /usr/local/lib/libsnappy* devcluster02:/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/ scp /usr/local/lib/libsnappy* devcluster03:/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/ 8、vi /etc/profile 追加: export LD_LIBRARY_PATH=/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64 9、修改mapred-site.xml <property> <name>mapred.output.compression.type</name> <value>BLOCK</value> <description>If the job outputs are to compressed as SequenceFiles, how should they be compressed? Should be one of NONE, RECORD or BLOCK. </description> </property>
<property> <name>mapred.output.compress</name> <value>true</value> <description>Should the job outputs be compressed? </description> </property>
<property> <name>mapred.compress.map.output</name> <value>true</value> <description>Should the outputs of the maps be compressed before being sent across the network. Uses SequenceFile compression. </description> </property>
<property> <name>mapred.map.output.compression.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> <description>If the map outputs are compressed, how should they be compressed? </description> </property>
<property> <name>mapred.output.compression.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> <description>If the job outputs are compressed, how should they be compressed? </description> </property>
九、Hadoop配置Lzo压缩 1、wget 2、tar -zxvf lzo-2.06.tar.gz 3、cd lzo-2.06 4、./configure --enable-shared 5、make 6、make install 7、scp /usr/local/lib/liblzo2.* devcluster01:/lib/x86_64-linux-gnu scp /usr/local/lib/liblzo2.* devcluster02:/lib/x86_64-linux-gnu scp /usr/local/lib/liblzo2.* devcluster03:/lib/x86_64-linux-gnu 8、wget 9、tar -xzvf hadoop-gpl-compression-0.1.0-rc0.tar.gz 10、cd hadoop-gpl-compression-0.1.0 11、cp lib/native/Linux-amd64-64/* /home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/ 12、cp hadoop-gpl-compression-0.1.0.jar /home/ysc/hadoop-1.1.1/lib/(这里hadoop集群的版本要和compression使用的版本一致) 13、scp -r /home/ysc/hadoop-1.1.1/lib devcluster02:/home/ysc/hadoop-1.1.1/ scp -r /home/ysc/hadoop-1.1.1/lib devcluster03:/home/ysc/hadoop-1.1.1/ 14、vi /etc/profile 追加: export LD_LIBRARY_PATH=/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64 15、修改core-site.xml <property> <name>io.compression.codecs</name> <value>com.hadoop.compression.lzo.LzoCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec</value> <description>A list of the compression codec classes that can be used for compression/decompression.</description> </property>
<property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property>
<property> <name>fs.trash.interval</name> <value>1440</value> <description>Number of minutes between trash checkpoints. If zero, the trash feature is disabled. </description> </property> 16、修改mapred-site.xml <property> <name>mapred.output.compression.type</name> <value>BLOCK</value> <description>If the job outputs are to compressed as SequenceFiles, how should they be compressed? Should be one of NONE, RECORD or BLOCK. </description> </property>
<property> <name>mapred.output.compress</name> <value>true</value> <description>Should the job outputs be compressed? </description> </property>
<property> <name>mapred.compress.map.output</name> <value>true</value> <description>Should the outputs of the maps be compressed before being sent across the network. Uses SequenceFile compression. </description> </property>
<property> <name>mapred.map.output.compression.codec</name> <value>com.hadoop.compression.lzo.LzoCodec</value> <description>If the map outputs are compressed, how should they be compressed? </description> </property>
<property> <name>mapred.output.compression.codec</name> <value>com.hadoop.compression.lzo.LzoCodec</value> <description>If the job outputs are compressed, how should they be compressed? </description> </property>
十、配置zookeeper集群以运行hbase 1、ssh devcluster01 2、cd /home/ysc 3、wget 4、tar -zxvf zookeeper-3.4.5.tar.gz 5、cd zookeeper-3.4.5 6、cp conf/zoo_sample.cfg conf/zoo.cfg 7、vi conf/zoo.cfg 修改:dataDir=/home/ysc/zookeeper 添加: server.1=devcluster01:2888:3888 server.2=devcluster02:2888:3888 server.3=devcluster03:2888:3888 maxClientCnxns=100 8、scp -r zookeeper-3.4.5 devcluster01:/home/ysc scp -r zookeeper-3.4.5 devcluster02:/home/ysc scp -r zookeeper-3.4.5 devcluster03:/home/ysc 9、分别在三台机器上面执行: ssh devcluster01 mkdir /home/ysc/zookeeper(注:dataDir是zookeeper的数据目录,需要手动创建) echo 1 > /home/ysc/zookeeper/myid ssh devcluster02 mkdir /home/ysc/zookeeper echo 2 > /home/ysc/zookeeper/myid ssh devcluster03 mkdir /home/ysc/zookeeper echo 3 > /home/ysc/zookeeper/myid 10、分别在三台机器上面执行: cd /home/ysc/zookeeper-3.4.5 bin/zkServer.sh start bin/zkCli.sh -server devcluster01:2181 bin/zkServer.sh status
十一、配置Hbase集群以运行nutch-2.1(Region Servers会因为内存的问题宕机) 1、nutch-2.1使用gora-0.2.1, gora-0.2.1使用hbase-0.90.4,hbase-0.90.4和hadoop-1.1.1不兼容,hbase-0.94.4和gora-0.2.1不兼容,hbase-0.92.2没问题。hbase存在系统时间同步的问题,并且误差要再30s以内。 sudo apt-get install ntp sudo ntpdate -u 210.72.145.44 2、HBase是数据库,会在同一时间使用很多的文件句柄。大多数linux系统使用的默认值1024是不能满足的。还需要修改 hbase 用户的 nproc,在压力下,如果过低会造成 OutOfMemoryError异常。 vi /etc/security/limits.conf 添加: ysc soft nproc 32000 ysc hard nproc 32000 ysc soft nofile 32768 ysc hard nofile 32768 vi /etc/pam.d/common-session 添加: session required pam_limits.so 3、登陆master,下载并解压hbase ssh devcluster01 cd /home/ysc wget tar -zxvf hbase-0.92.2.tar.gz cd hbase-0.92.2 4、修改配置文件hbase-env.sh vi conf/hbase-env.sh 追加: export JAVA_HOME=/home/ysc/jdk1.7.0_05 export HBASE_MANAGES_ZK=false export HBASE_HEAPSIZE=10000 #替换掉默认的垃圾回收器,因为默认的垃圾回收器在多线程环境下会有更多的wait等待 export HBASE_OPTS="-server -Xmn256m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70" 5、修改配置文件hbase-site.xml vi conf/hbase-site.xml <property> <name>hbase.rootdir</name> <value>hdfs://devcluster01:9000/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>devcluster01,devcluster02,devcluster03</value> </property> <property> <name>hfile.block.cache.size</name> <value>0.25</value> <description> Percentage of maximum heap (-Xmx setting) to allocate to block cache used by HFile/StoreFile. Default of 0.25 means allocate 25%. Set to 0 to disable but it's not recommended. </description> </property> <property> <name>hbase.regionserver.global.memstore.upperLimit</name> <value>0.4</value> <description>Maximum size of all memstores in a region server before new updates are blocked and flushes are forced. Defaults to 40% of heap </description> </property> <property> <name>hbase.regionserver.global.memstore.lowerLimit</name> <value>0.35</value> <description>When memstores are being forced to flush to make room in memory, keep flushing until we hit this mark. Defaults to 35% of heap. This value equal to hbase.regionserver.global.memstore.upperLimit causes the minimum possible flushing to occur when updates are blocked due to memstore limiting. </description> </property> <property> <name>hbase.hregion.majorcompaction</name> <value>0</value> <description>The time (in miliseconds) between 'major' compactions of all HStoreFiles in a region. Default: 1 day. Set to 0 to disable automated major compactions. </description> </property> 6、修改配置文件regionservers vi conf/regionservers devcluster01 devcluster02 devcluster03 7、因为HBase建立在Hadoop之上,Hadoop使用的hadoop*.jar和HBase使用的 必须 一致。所以要将 HBase lib 目录下的hadoop*.jar替换成Hadoop里面的那个,防止版本冲突。 cp /home/ysc/hadoop-1.1.1/hadoop-core-1.1.1.jar /home/ysc/hbase-0.92.2/lib rm /home/ysc/hbase-0.92.2/lib/hadoop-core-1.0.3.jar 8、复制文件到regionservers scp -r /home/ysc/hbase-0.92.2 devcluster01:/home/ysc scp -r /home/ysc/hbase-0.92.2 devcluster02:/home/ysc scp -r /home/ysc/hbase-0.92.2 devcluster03:/home/ysc 9、启动hadoop并创建目录 hadoop fs -mkdir /hbase 10、管理HBase集群: 启动初始 HBase 集群: bin/start-hbase.sh 停止HBase 集群: bin/stop-hbase.sh 启动额外备份主服务器,可以启动到 9 个备份服务器 (总数10 个): bin/local-master-backup.sh start 1 bin/local-master-backup.sh start 2 3 启动更多 regionservers, 支持到 99 个额外regionservers (总100个): bin/local-regionservers.sh start 1 bin/local-regionservers.sh start 2 3 4 5 停止备份主服务器: cat /tmp/hbase-ysc-1-master.pid |xargs kill -9 停止单独 regionserver: bin/local-regionservers.sh stop 1 使用HBase命令行模式: bin/hbase shell 11、web界面 12、如运行nutch2.1则方法一: cp conf/hbase-site.xml /home/ysc/nutch-2.1/conf cd /home/ysc/nutch-2.1 ant cd runtime/deploy unzip -d apache-nutch-2.1 apache-nutch-2.1.job rm apache-nutch-2.1.job cd apache-nutch-2.1 rm lib/hbase-0.90.4.jar cp /home/ysc/hbase-0.92.2/hbase-0.92.2.jar lib zip -r ../apache-nutch-2.1.job ./* cd .. rm -r apache-nutch-2.1 13、如运行nutch2.1则方法二: cp conf/hbase-site.xml /home/ysc/nutch-2.1/conf cd /home/ysc/nutch-2.1 cp /home/ysc/hbase-0.92.2/hbase-0.92.2.jar lib ant cd runtime/deploy zip -d apache-nutch-2.1.job lib/hbase-0.90.4.jar
启用snappy压缩: 1、vi conf/gora-hbase-mapping.xml 在family上面添加属性:compression="SNAPPY" 2、mkdir /home/ysc/hbase-0.92.2/lib/native/Linux-amd64-64 3、cp /home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/* /home/ysc/hbase-0.92.2/lib/native/Linux-amd64-64 4、vi /home/ysc/hbase-0.92.2/conf/hbase-site.xml 增加: <property> <name>hbase.regionserver.codecs</name> <value>snappy</value> </property>
十二、配置Accumulo集群以运行nutch-2.1(gora存在BUG) 1、wget 2、tar -xzvf accumulo-1.4.2-dist.tar.gz 3、cd accumulo-1.4.2 4、cp conf/examples/3GB/standalone/* conf 5、vi conf/accumulo-env.sh export HADOOP_HOME=/home/ysc/cluster3 export ZOOKEEPER_HOME=/home/ysc/zookeeper-3.4.5 export JAVA_HOME=/home/jdk1.7.0_01 export ACCUMULO_HOME=/home/ysc/accumulo-1.4.2 6、vi conf/slaves devcluster01 devcluster02 devcluster03 7、vi conf/masters devcluster01 8、vi conf/accumulo-site.xml <property> <name>instance.zookeeper.host</name> <value>host6:2181,host8:2181</value> <description>comma separated list of zookeeper servers</description> </property>
<property> <name>logger.dir.walog</name> <value>walogs</value> <description>The directory used to store write-ahead logs on the local filesystem. It is possible to specify a comma-separated list of directories.</description> </property>
<property> <name>instance.secret</name> <value>ysc</value> <description>A secret unique to a given instance that all servers must know in order to communicate with one another. Change it before initialization. To change it later use ./bin/accumulo org.apache.accumulo.server.util.ChangeSecret [oldpasswd] [newpasswd], and then update this file. </description> </property>
<property> <name>tserver.memory.maps.max</name> <value>3G</value> </property>
<property> <name>tserver.cache.data.size</name> <value>50M</value> </property>
<property> <name>tserver.cache.index.size</name> <value>512M</value> </property>
<property> <name>trace.password</name> <!-- change this to the root user's password, and/or change the user below --> <value>ysc</value> </property>
<property> <name>trace.user</name> <value>root</value> </property> 9、bin/accumulo init 10、bin/start-all.sh 11、bin/stop-all.sh 12、web访问:
修改nutch2.1: 1、cd /home/ysc/nutch-2.1 2、vi conf/gora.properties 增加: gora.datastore.default=org.apache.gora.accumulo.store.AccumuloStore gora.datastore.accumulo.mock=false gora.datastore.accumulo.instance=accumulo gora.datastore.accumulo.zookeepers=host6,host8 gora.datastore.accumulo.user=root gora.datastore.accumulo.password=ysc 3、vi conf/nutch-site.xml 增加: <property> <name>storage.data.store.class</name> <value>org.apache.gora.accumulo.store.AccumuloStore</value> </property> 4、vi ivy/ivy.xml 增加: <dependency org="org.apache.gora" name="gora-accumulo" rev="0.2.1" conf="*->default" /> 5、升级accumulo cp /home/ysc/accumulo-1.4.2/lib/accumulo-core-1.4.2.jar /home/ysc/nutch-2.1/lib cp /home/ysc/accumulo-1.4.2/lib/accumulo-start-1.4.2.jar /home/ysc/nutch-2.1/lib cp /home/ysc/accumulo-1.4.2/lib/cloudtrace-1.4.2.jar /home/ysc/nutch-2.1/lib 6、ant 7、cd runtime/deploy 8、删除旧jar zip -d apache-nutch-2.1.job lib/accumulo-core-1.4.0.jar zip -d apache-nutch-2.1.job lib/accumulo-start-1.4.0.jar zip -d apache-nutch-2.1.job lib/cloudtrace-1.4.2.jar
十三、配置Cassandra 集群以运行nutch-2.1(Cassandra 采用去中心化结构) 1、vi /etc/hosts(注意:需要登录到每一台机器上面,将localhost解析到实际地址) 192.168.1.1 localhost 2、wget 3、tar -xzvf apache-cassandra-1.2.0-bin.tar.gz 4、cd apache-cassandra-1.2.0 5、vi conf/cassandra-env.sh 增加: MAX_HEAP_SIZE="4G" HEAP_NEWSIZE="800M" 6、vi conf/log4j-server.properties 修改: log4j.appender.R.File=/home/ysc/cassandra/system.log 7、vi conf/cassandra.yaml 修改: cluster_name: 'Cassandra Cluster' data_file_directories: - /home/ysc/cassandra/data commitlog_directory: /home/ysc/cassandra/commitlog saved_caches_directory: /home/ysc/cassandra/saved_caches
- seeds: "192.168.1.1" listen_address: 192.168.1.1 rpc_address: 192.168.1.1
thrift_framed_transport_size_in_mb: 1023 thrift_max_message_length_in_mb: 1024 8、vi bin/stop-server 增加: user=`whoami` pgrep -u $user -f cassandra | xargs kill -9 9、复制cassandra到其他节点: cd .. scp -r apache-cassandra-1.2.0 devcluster02:/home/ysc scp -r apache-cassandra-1.2.0 devcluster03:/home/ysc 分别在devcluster02和devcluster03上面修改: vi conf/cassandra.yaml listen_address: 192.168.1.2 rpc_address: 192.168.1.2 vi conf/cassandra.yaml listen_address: 192.168.1.3 rpc_address: 192.168.1.3 10、分别在3个节点上面运行 bin/cassandra bin/cassandra -f 参数 -f 的作用是让 Cassandra 以前端程序方式运行,这样有利于调试和观察日志信息,而在实际生产环境中这个参数是不需要的(即 Cassandra 会以 daemon 方式运行) 11、bin/nodetool -host devcluster01 ring bin/nodetool -host devcluster01 info 12、bin/stop-server 13、bin/cassandra-cli
修改nutch2.1: 1、cd /home/ysc/nutch-2.1 2、vi conf/gora.properties 增加: gora.cassandrastore.servers=host2:9160,host6:9160,host8:9160 3、vi conf/nutch-site.xml 增加: <property> <name>storage.data.store.class</name> <value>org.apache.gora.cassandra.store.CassandraStore</value> </property> 4、vi ivy/ivy.xml 增加: <dependency org="org.apache.gora" name="gora-cassandra" rev="0.2.1" conf="*->default" /> 5、升级cassandra cp /home/ysc/apache-cassandra-1.2.0/lib/apache-cassandra-1.2.0.jar /home/ysc/nutch-2.1/lib cp /home/ysc/apache-cassandra-1.2.0/lib/apache-cassandra-thrift-1.2.0.jar /home/ysc/nutch-2.1/lib cp /home/ysc/apache-cassandra-1.2.0/lib/jline-1.0.jar /home/ysc/nutch-2.1/lib 6、ant 7、cd runtime/deploy 8、删除旧jar zip -d apache-nutch-2.1.job lib/cassandra-thrift-1.1.2.jar zip -d apache-nutch-2.1.job lib/jline-0.9.1.jar
十四、配置MySQL 单机服务器以运行nutch-2.1 1、apt-get install mysql-server mysql-client 2、vi /etc/mysql/my.cnf 修改: bind-address = 221.194.43.2 在[client]下增加: default-character-set=utf8 在[mysqld]下增加: default-character-set=utf8 3、mysql –uroot –pysc SHOW VARIABLES LIKE '%character%'; 4、service mysql restart 5、mysql –uroot –pysc GRANT ALL PRIVILEGES ON *.* TO root@"%" IDENTIFIED BY "ysc"; 6、vi conf/gora-sql-mapping.xml 修改字段的长度 <primarykey column="id" length="333"/> <field name="content" column="content" /> <field name="text" column="text" length="19892"/> 7、启动nutch之后登陆mysql ALTER TABLE webpage MODIFY COLUMN content MEDIUMBLOB; ALTER TABLE webpage MODIFY COLUMN text MEDIUMTEXT; ALTER TABLE webpage MODIFY COLUMN title MEDIUMTEXT; ALTER TABLE webpage MODIFY COLUMN reprUrl MEDIUMTEXT; ALTER TABLE webpage MODIFY COLUMN baseUrl MEDIUMTEXT; ALTER TABLE webpage MODIFY COLUMN typ MEDIUMTEXT; ALTER TABLE webpage MODIFY COLUMN inlinks MEDIUMBLOB; ALTER TABLE webpage MODIFY COLUMN outlinks MEDIUMBLOB;
修改nutch2.1: 1、cd /home/ysc/nutch-2.1 2、vi conf/gora.properties 增加: gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver gora.sqlstore.jdbc.url=jdbc:mysql://host2:3306/nutch?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=utf8 gora.sqlstore.jdbc.user=root gora.sqlstore.jdbc.password=ysc 3、vi conf/nutch-site.xml 增加: <property> <name>storage.data.store.class</name> <value>org.apache.gora.sql.store.SqlStore </value> </property>
<property> <name>encodingdetector.charset.min.confidence</name> <value>1</value> <description>A integer between 0-100 indicating minimum confidence value for charset auto-detection. Any negative value disables auto-detection. </description> </property> 4、vi ivy/ivy.xml 增加: <dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>
十五、nutch2.1 使用DataFileAvroStore作为数据源 1、cd /home/ysc/nutch-2.1 2、vi conf/gora.properties 增加: gora.datafileavrostore.output.path=datafileavrostore gora.datafileavrostore.input.path=datafileavrostore 3、vi conf/nutch-site.xml 增加: <property> <name>storage.data.store.class</name> <value>org.apache.gora.avro.store.DataFileAvroStore</value> </property>
<property> <name>encodingdetector.charset.min.confidence</name> <value>1</value> <description>A integer between 0-100 indicating minimum confidence value for charset auto-detection. Any negative value disables auto-detection. </description> </property>
十六、nutch2.1 使用AvroStore作为数据源 1、cd /home/ysc/nutch-2.1 2、vi conf/gora.properties 增加: gora.avrostore.codec.type=BINARY gora.avrostore.input.path=avrostore gora.avrostore.output.path=avrostore 3、vi conf/nutch-site.xml 增加: <property> <name>storage.data.store.class</name> <value>org.apache.gora.avro.store.AvroStore</value> </property>
<property> <name>encodingdetector.charset.min.confidence</name> <value>1</value> <description>A integer between 0-100 indicating minimum confidence value for charset auto-detection. Any negative value disables auto-detection. </description> </property>
十七、配置SOLR 配置tomcat: 1、wget 2、tar -xzvf apache-tomcat-7.0.35.tar.gz 3、cd apache-tomcat-7.0.35 4、vi conf/server.xml 增加URIEncoding="UTF-8": <Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" URIEncoding="UTF-8"/> 5、mkdir conf/Catalina 6、mkdir conf/Catalina/localhost 7、vi conf/Catalina/localhost/solr.xml 增加: <Context path="/solr"> <Environment name="solr/home" type="java.lang.String" value="/home/ysc/solr/configuration/" override="false"/> </Context> 8、cd ..
下载SOLR: 1、wget 2、tar -xzvf solr-4.1.0.tgz
复制资源: 1、mkdir /home/ysc/solr 2、cp -r solr-4.1.0/example/solr /home/ysc/solr/configuration 3、unzip solr-4.1.0/example/webapps/solr.war -d /home/ysc/apache-tomcat-7.0.35/webapps/solr
配置nutch: 1、复制schema: cp /home/ysc/nutch-1.6/conf/schema-solr4.xml /home/ysc/solr/configuration/collection1/conf/schema.xml 2、vi /home/ysc/solr/configuration/collection1/conf/schema.xml 在<fields>下增加: <field name="_version_" type="long" indexed="true" stored="true"/>
配置中文分词: 1、wget 2、unzip mmseg4j-1.9.1.v20130120-SNAPSHOT.zip 3、cp mmseg4j-1.9.1-SNAPSHOT/dist/* /home/ysc/apache-tomcat-7.0.35/webapps/solr/WEB-INF/lib 4、unzip mmseg4j-1.9.1-SNAPSHOT/dist/mmseg4j-core-1.9.1-SNAPSHOT.jar -d mmseg4j-1.9.1-SNAPSHOT/dist/mmseg4j-core-1.9.1-SNAPSHOT 5、mkdir /home/ysc/dic 6、cp mmseg4j-1.9.1-SNAPSHOT/dist/mmseg4j-core-1.9.1-SNAPSHOT/data/* /home/ysc/dic 7、vi /home/ysc/solr/configuration/collection1/conf/schema.xml 将文件中的 <tokenizer class="solr.WhitespaceTokenizerFactory"/> 和 <tokenizer class="solr.StandardTokenizerFactory"/> 替换为 <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="/home/ysc/dic"/>
配置tomcat本地库: 1、wget 2、tar -xzvf apr-1.4.6.tar.gz 3、cd apr-1.4.6 4、./configure 5、make 6、make install
1、wget 2、tar -xzvf apr-util-1.5.1.tar.gz 3、cd apr-util-1.5.1 4、./configure --with-apr=/usr/local/apr 5、make 6、make install
1、wget 2、tar -zxvf tomcat-native-1.1.24-src.tar.gz 3、cd tomcat-native-1.1.24-src/jni/native 4、./configure --with-apr=/usr/local/apr \ --with-java-home=/home/ysc/jdk1.7.0_01 \ --with-ssl=no \ --prefix=/home/ysc/apache-tomcat-7.0.35 5、make 6、make install 7、vi /etc/profile 增加: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/ysc/apache-tomcat-7.0.35/lib:/usr/local/apr/lib 8、source /etc/profile
启动tomcat: cd apache-tomcat-7.0.35 bin/catalina.sh start
十八、Nagios监控 服务端: 1、apt-get install apache2 nagios3 nagios-nrpe-plugin 输入密码:nagiosadmin 2、apt-get install nagios3-doc 3、vi /etc/nagios3/conf.d/hostgroups_nagios2.cfg define hostgroup { hostgroup_name nagios-servers alias nagios servers members devcluster01,devcluster02,devcluster03 } 4、cp /etc/nagios3/conf.d/localhost_nagios2.cfg /etc/nagios3/conf.d/devcluster01_nagios2.cfg vi /etc/nagios3/conf.d/devcluster01_nagios2.cfg 替换: g/localhost/s//devcluster01/g g/127.0.0.1/s//192.168.1.1/g 5、cp /etc/nagios3/conf.d/localhost_nagios2.cfg /etc/nagios3/conf.d/devcluster02_nagios2.cfg vi /etc/nagios3/conf.d/devcluster02_nagios2.cfg 替换: g/localhost/s//devcluster02/g g/127.0.0.1/s//192.168.1.2/g 6、cp /etc/nagios3/conf.d/localhost_nagios2.cfg /etc/nagios3/conf.d/devcluster03_nagios2.cfg vi /etc/nagios3/conf.d/devcluster03_nagios2.cfg 替换: g/localhost/s//devcluster03/g g/127.0.0.1/s//192.168.1.3/g
7、vi /etc/nagios3/conf.d/services_nagios2.cfg 将hostgroup_name改为nagios-servers 增加: # check that web services are running define service { hostgroup_name nagios-servers service_description HTTP check_command check_http use generic-service notification_interval 0 ; set > 0 if you want to be renotified }
# check that ssh services are running define service { hostgroup_name nagios-servers service_description SSH check_command check_ssh use generic-service notification_interval 0 ; set > 0 if you want to be renotified } 8、vi /etc/nagios3/conf.d/extinfo_nagios2.cfg 将hostgroup_name改为nagios-servers 增加: define hostextinfo{ hostgroup_name nagios-servers notes nagios-servers # notes_url icon_image base/debian.png icon_image_alt Debian GNU/Linux vrml_image debian.png statusmap_image base/debian.gd2 } 9、sudo /etc/init.d/nagios3 restart 10、访问 用户名:nagiosadmin密码:nagiosadmin
监控端: 1、apt-get install nagios-nrpe-server 2、vi /etc/nagios/nrpe.cfg 替换: g/127.0.0.1/s//192.168.1.1/g 3、sudo /etc/init.d/nagios-nrpe-server restart
十九、配置Splunk 1、wget 2、tar -zxvf splunk-5.0.2-149561-Linux-x86_64.tgz 3、cd splunk 4、bin/splunk start --answer-yes --no-prompt --accept-license 5、访问 用户名:admin 密码:changeme 6、添加数据 -> 从 UDP 端口 -> UDP 端口 *: 1688 -> 来源类型 从列表 log4j -> 保存 7、配置hadoop vi /home/ysc/hadoop-1.1.1/conf/log4j.properties 修改: log4j.rootLogger=${hadoop.root.logger}, EventCounter, SYSLOG 增加: log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender log4j.appender.SYSLOG.facility=local1 log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout log4j.appender.SYSLOG.layout.ConversionPattern=%p %c{2}: %m%n log4j.appender.SYSLOG.SyslogHost=host6:1688 log4j.appender.SYSLOG.threshold=INFO log4j.appender.SYSLOG.Header=true log4j.appender.SYSLOG.FacilityPrinting=true 8、配置hbase vi /home/ysc/hbase-0.92.2/conf/log4j.properties 修改: log4j.rootLogger=${hbase.root.logger},SYSLOG 增加: log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender log4j.appender.SYSLOG.facility=local1 log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout log4j.appender.SYSLOG.layout.ConversionPattern=%p %c{2}: %m%n log4j.appender.SYSLOG.SyslogHost=host6:1688 log4j.appender.SYSLOG.threshold=INFO log4j.appender.SYSLOG.Header=true log4j.appender.SYSLOG.FacilityPrinting=true 9、配置nutch vi /home/lanke/ysc/nutch-2.1-hbase/conf/log4j.properties 修改: log4j.rootLogger=INFO,DRFA,SYSLOG 增加: log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender log4j.appender.SYSLOG.facility=local1 log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout log4j.appender.SYSLOG.layout.ConversionPattern=%p %c{2}: %m%n log4j.appender.SYSLOG.SyslogHost=host6:1688 log4j.appender.SYSLOG.threshold=INFO log4j.appender.SYSLOG.Header=true log4j.appender.SYSLOG.FacilityPrinting=true 10、启动hadoop和hbase start-all.sh start-hbase.sh
二十、配置Pig 1、wget 2、tar -xzvf pig-0.11.0.tar.gz 3、cd pig-0.11.0 4、vi /etc/profile 增加: export PIG_HOME=/home/ysc/pig-0.11.0 export PATH=$PIG_HOME/bin:$PATH 5、source /etc/profile 6、cp conf/log4j.properties.template conf/log4j.properties 7、vi conf/log4j.properties 8、pig
二十一、配置Hive 1、wget 2、tar -xzvf hive-0.10.0.tar.gz 3、cd hive-0.10.0 4、vi /etc/profile 增加: export HIVE_HOME=/home/ysc/hive-0.10.0 export PATH=$HIVE_HOME/bin:$PATH 5、source /etc/profile 6、cp conf/hive-log4j.properties.template conf/hive-log4j.properties 7、vi conf/hive-log4j.properties 替换: log4j.appender.EventCounter=org.apache.hadoop.metrics.jvm.EventCounter 为: log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter
二十二、配置Hadoop2.x集群 1、wget 2、tar -xzvf hadoop-2.0.2-alpha.tar.gz 3、cd hadoop-2.0.2-alpha 4、vi etc/hadoop/hadoop-env.sh 追加: export JAVA_HOME=/home/ysc/jdk1.7.0_05 export HADOOP_HEAPSIZE=2000 5、vi etc/hadoop/core-site.xml <property> <name>fs.defaultFS</name> <value>hdfs://devcluster01:9000</value> <description> Where to find the Hadoop Filesystem through the network. Note 9000 is not the default port. (This is slightly changed from previous versions which didnt have "hdfs") </description> </property> <property> <name>io.file.buffer.size</name> <value>131072</value> <description>The size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations.</description> </property> 6、vi etc/hadoop/mapred-site.xml <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
<property> <name>mapred.job.reduce.input.buffer.percent</name> <value>1</value> <description>The percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin. </description> </property>
<property> <name>mapred.job.shuffle.input.buffer.percent</name> <value>1</value> <description>The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle. </description> </property>
<property> <name>mapred.inmem.merge.threshold</name> <value>0</value> <description>The threshold, in terms of the number of files for the in-memory merge process. When we accumulate threshold number of files we initiate the in-memory merge and spill to disk. A value of 0 or less than 0 indicates we want to DON'T have any threshold and instead depend only on the ramfs's memory consumption to trigger the merge. </description> </property>
<property> <name>io.sort.factor</name> <value>100</value> <description>The number of streams to merge at once while sorting files. This determines the number of open file handles.</description> </property>
<property> <name>io.sort.mb</name> <value>240</value> <description>The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.</description> </property> <property> <name>mapred.map.output.compression.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> <description>If the map outputs are compressed, how should they be compressed? </description> </property>
<property> <name>mapred.output.compression.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> <description>If the job outputs are compressed, how should they be compressed? </description> </property> <property> <name>mapred.output.compression.type</name> <value>BLOCK</value> <description>If the job outputs are to compressed as SequenceFiles, how should they be compressed? Should be one of NONE, RECORD or BLOCK. </description> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx2000m</value> </property>
<property> <name>mapred.output.compress</name> <value>true</value> <description>Should the job outputs be compressed? </description> </property>
<property> <name>mapred.compress.map.output</name> <value>true</value> <description>Should the outputs of the maps be compressed before being sent across the network. Uses SequenceFile compression. </description> </property>
<property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>5</value> </property>
<property> <name>mapred.map.tasks</name> <value>15</value> </property>
<property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>5</value> <description> define mapred.map tasks to be number of slave hosts.the best number is the number of slave hosts plus the core numbers of per host </description> </property>
<property> <name>mapred.reduce.tasks</name> <value>15</value> <description> define mapred.reduce tasks to be number of slave hosts.the best number is the number of slave hosts plus the core numbers of per host </description> </property> <property> <name>mapred.system.dir</name> <value>/home/ysc/mapreduce/system</value> </property>
<property> <name>mapred.local.dir</name> <value>/home/ysc/mapreduce/local</value> </property>
<property> <name>mapreduce.job.counters.max</name> <value>12000</value> <description>Limit on the number of counters allowed per job. </description> </property> 7、vi etc/hadoop/yarn-site.xml <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>devcluster01:8031</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>devcluster01:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>devcluster01:8030</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>devcluster01:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>devcluster01:8088</value> </property> <property> <description>Classpath for typical applications.</description> <name>yarn.application.classpath</name> <value> $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*, $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*, $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*, $YARN_HOME/*,$YARN_HOME/lib/* </value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce.shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>/home/ysc/h2/data/1/yarn/local,/home/ysc/h2/data/2/yarn/local,/home/ysc/h2/data/3/yarn/local</value> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>/home/ysc/h2/data/1/yarn/logs,/home/ysc/h2/data/2/yarn/logs,/home/ysc/h2/data/3/yarn/logs</value> </property> <property> <description>Where to aggregate logs</description> <name>yarn.nodemanager.remote-app-log-dir</name> <value>/home/ysc/h2/var/log/hadoop-yarn/apps</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>devcluster01:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>devcluster01:19888</value> </property> 8、vi etc/hadoop/hdfs-site.xml <property> <name>dfs.permissions.superusergroup</name> <value>root</value> </property> <property> <name>dfs.name.dir</name> <value>/home/ysc/dfs/filesystem/name</value> </property> <property> <name>dfs.data.dir</name> <value>/home/ysc/dfs/filesystem/data</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.block.size</name> <value>6710886400</value> <description>The default block size for new files.</description> </property> 9、启动hadoop bin/hdfs namenode -format sbin/start-dfs.sh sbin/start-yarn.sh 10、访问管理页面 from http://user.qzone.qq.com/281032878?ptlang=2052&ADUIN=247504123&ADSESSION=1366522125&ADTAG=CLIENT.QQ.3439_FriendInfo_PersonalInfo.0#!app=2&via=QZ.HashRefresh&pos=1362131478