Created on 05-11-2017 11:54 AM - edited 09-16-2022 04:35 AM
Hi All,
I have 3 node Cloudera cluster, running Cloudera 5.9. I want to make a web crawler and therefore want to Install Apache Nutch.
Can anyone please guide me how to install on a Existing Hadoop Cluster(Hadoop version 2.6.0).
I have downloaded the tar from http://www.apache.org/dyn/closer.lua/nutch/2.3.1/apache-nutch-2.3.1-src.tar.gz
And extarcted the folder, but when I go inside, I see only these files:
[hdfs@X.X.X.X bin]$ pwd /var/lib/hadoop-hdfs/nutch/apache-nutch-2.3.1/src/bin [hdfs@X.X.X.X bin]$ ll total 20 -rwxr-xr-x 1 hdfs hadoop 5453 Jan 10 2016 crawl -rwxr-xr-x 1 hdfs hadoop 8801 Jan 10 2016 nutch [hdfs@X.X.X.X apache-nutch-2.3.1]$ ll total 488 -rw-r--r-- 1 hdfs hadoop 46132 Jan 10 2016 build.xml -rw-r--r-- 1 hdfs hadoop 82375 Jan 10 2016 CHANGES.txt drwxr-xr-x 2 hdfs hadoop 4096 May 11 13:23 conf -rw-r--r-- 1 hdfs hadoop 4903 Jan 10 2016 default.properties drwxr-xr-x 3 hdfs hadoop 4096 Jan 10 2016 docs drwxr-xr-x 2 hdfs hadoop 4096 May 11 13:23 ivy drwxr-xr-x 3 hdfs hadoop 4096 Jan 10 2016 lib -rw-r--r-- 1 hdfs hadoop 329066 Jan 10 2016 LICENSE.txt -rw-r--r-- 1 hdfs hadoop 429 Jan 10 2016 NOTICE.txt drwxr-xr-x 9 hdfs hadoop 4096 Jan 10 2016 src
Thanks,
Shilpa
Created 05-12-2017 02:13 PM
Nutch is installed.
FOr this I had to download ant and build the code. Make sure to set $JAVA_HOME correctly.
[hdfs@X.X.X.X apache-nutch-2.3.1]$ant runtime
As I had to setup it with MongoDB, so do these changes in $NUTCH_HOME/conf/nutch-site.xml
<configuration> <property> <name>storage.data.store.class</name> <value>org.apache.gora.mongodb.store.MongoStore</value> <description>Default class for storing data</description> </property> </configuration>
Ensure the MongoDB gora-mongodb dependency is available in $NUTCH_HOME/ivy/ivy.xml; Uncomment the below line from the file
$ vim $NUTCH_HOME/ivy/ivy.xml ... <dependency org="org.apache.gora" name="gora-mongodb" rev="0.5" conf="*->default" /> ... </dependency>
Also, Ensure that MongoStore is set as the default datastore in $NUTCH_HOME/conf/gora.properties. Give all the details related to mongoDB.
Thanks,
Shilpa
Created 05-12-2017 02:13 PM
Nutch is installed.
FOr this I had to download ant and build the code. Make sure to set $JAVA_HOME correctly.
[hdfs@X.X.X.X apache-nutch-2.3.1]$ant runtime
As I had to setup it with MongoDB, so do these changes in $NUTCH_HOME/conf/nutch-site.xml
<configuration> <property> <name>storage.data.store.class</name> <value>org.apache.gora.mongodb.store.MongoStore</value> <description>Default class for storing data</description> </property> </configuration>
Ensure the MongoDB gora-mongodb dependency is available in $NUTCH_HOME/ivy/ivy.xml; Uncomment the below line from the file
$ vim $NUTCH_HOME/ivy/ivy.xml ... <dependency org="org.apache.gora" name="gora-mongodb" rev="0.5" conf="*->default" /> ... </dependency>
Also, Ensure that MongoStore is set as the default datastore in $NUTCH_HOME/conf/gora.properties. Give all the details related to mongoDB.
Thanks,
Shilpa
Created 05-19-2017 12:18 PM
Though Nutch is installed, It is NOT running on Hadoop. It is just installed on the VM.
Can anyone help me in running Nutch on top of Existing Hadoop Cluster.??
Created 08-15-2017 11:54 PM
1. hadoop fs -put <url folder> <target>
2. hadoop jar <deployment-jar> <classname> other_params