Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Install and run Apache Nutch on existing Hadoop cluster

avatar
Expert Contributor

Hi All,

 

I have 3 node Cloudera cluster, running Cloudera 5.9. I want to make a web crawler and therefore want to Install Apache Nutch.

 

Can anyone please guide me how to install on a Existing Hadoop Cluster(Hadoop version 2.6.0). 

 

I have downloaded the tar from http://www.apache.org/dyn/closer.lua/nutch/2.3.1/apache-nutch-2.3.1-src.tar.gz 

And extarcted the folder, but when I go inside, I see only these files:

 

 

[hdfs@X.X.X.X bin]$ pwd
/var/lib/hadoop-hdfs/nutch/apache-nutch-2.3.1/src/bin
[hdfs@X.X.X.X bin]$ ll
total 20
-rwxr-xr-x 1 hdfs hadoop 5453 Jan 10 2016 crawl
-rwxr-xr-x 1 hdfs hadoop 8801 Jan 10 2016 nutch

[hdfs@X.X.X.X apache-nutch-2.3.1]$ ll
total 488
-rw-r--r-- 1 hdfs hadoop 46132 Jan 10 2016 build.xml
-rw-r--r-- 1 hdfs hadoop 82375 Jan 10 2016 CHANGES.txt
drwxr-xr-x 2 hdfs hadoop 4096 May 11 13:23 conf
-rw-r--r-- 1 hdfs hadoop 4903 Jan 10 2016 default.properties
drwxr-xr-x 3 hdfs hadoop 4096 Jan 10 2016 docs
drwxr-xr-x 2 hdfs hadoop 4096 May 11 13:23 ivy
drwxr-xr-x 3 hdfs hadoop 4096 Jan 10 2016 lib
-rw-r--r-- 1 hdfs hadoop 329066 Jan 10 2016 LICENSE.txt
-rw-r--r-- 1 hdfs hadoop 429 Jan 10 2016 NOTICE.txt
drwxr-xr-x 9 hdfs hadoop 4096 Jan 10 2016 src

Thanks,

Shilpa

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Nutch is installed.

 

FOr this I had to download ant and build the code. Make sure to set $JAVA_HOME correctly.

 

[hdfs@X.X.X.X apache-nutch-2.3.1]$ant runtime

As I had to setup it with MongoDB, so do these changes in $NUTCH_HOME/conf/nutch-site.xml

 

 

<configuration>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.mongodb.store.MongoStore</value>
    <description>Default class for storing data</description>
  </property>
</configuration>

Ensure the MongoDB gora-mongodb dependency is available in $NUTCH_HOME/ivy/ivy.xml; Uncomment the below line from the file

 

$ vim $NUTCH_HOME/ivy/ivy.xml
...
<dependency org="org.apache.gora" name="gora-mongodb" rev="0.5" conf="*->default" />
...
</dependency>

 

Also, Ensure that MongoStore is set as the default datastore in $NUTCH_HOME/conf/gora.properties. Give all the details related to mongoDB. 

 

Thanks,

Shilpa

 

View solution in original post

3 REPLIES 3

avatar
Expert Contributor

Nutch is installed.

 

FOr this I had to download ant and build the code. Make sure to set $JAVA_HOME correctly.

 

[hdfs@X.X.X.X apache-nutch-2.3.1]$ant runtime

As I had to setup it with MongoDB, so do these changes in $NUTCH_HOME/conf/nutch-site.xml

 

 

<configuration>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.mongodb.store.MongoStore</value>
    <description>Default class for storing data</description>
  </property>
</configuration>

Ensure the MongoDB gora-mongodb dependency is available in $NUTCH_HOME/ivy/ivy.xml; Uncomment the below line from the file

 

$ vim $NUTCH_HOME/ivy/ivy.xml
...
<dependency org="org.apache.gora" name="gora-mongodb" rev="0.5" conf="*->default" />
...
</dependency>

 

Also, Ensure that MongoStore is set as the default datastore in $NUTCH_HOME/conf/gora.properties. Give all the details related to mongoDB. 

 

Thanks,

Shilpa

 

avatar
Expert Contributor

Though Nutch is installed, It is NOT running on Hadoop. It is just installed on the VM.

 

Can anyone help me in running Nutch on top of Existing Hadoop Cluster.??

avatar
New Contributor

1. hadoop fs -put <url folder> <target>

2. hadoop jar <deployment-jar> <classname> other_params