question Re: nutch web crawling using hbase in hortonworks in Archives of Support Questions (Read Only)

nutch web crawling using hbase in hortonworks

hadoopsmi — Mon, 29 Feb 2016 12:36:16 GMT

i want crawl the web urls information using nutch and store the data in hbase db. any one can suggest for how to do this with some example. bcoz i am new one for nutch.

Re: nutch web crawling using hbase in hortonworks

nsabharwal — Mon, 29 Feb 2016 12:57:47 GMT

@sivasaravanakumar k Off topic : http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_search/index.html

Nutch --> http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial

You can use the same for multinode cluster

Re: nutch web crawling using hbase in hortonworks

nsabharwal — Mon, 29 Feb 2016 13:00:07 GMT

@sivasaravanakumar k FYI: Nutch is not part of HDP stack

Re: nutch web crawling using hbase in hortonworks

hadoopsmi — Mon, 29 Feb 2016 14:39:59 GMT

i got this error message

[root@sandbox ~]# bin/nutch fetch 1456727546-2019589981

Exception in thread "main" java.lang.RuntimeException: job failed: name=apache-nutch-2.3.1.jar, jobid=job_local522155708_0001 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120) at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:205) at org.apache.nutch.fetcher.FetcherJob.fetch(FetcherJob.java:251) at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:314) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.fetcher.FetcherJob.main(FetcherJob.java:322)

Re: nutch web crawling using hbase in hortonworks

nsabharwal — Mon, 29 Feb 2016 15:49:29 GMT

@sivasaravanakumar k http://nutch.apache.org/

Recommender: Apache Hadoop 2.5.2

I highly recommend to take a look on this http://stackoverflow.com/questions/4269632/an-alternative-web-crawler-to-nutch

Nutch tutorial http://cs.boisestate.edu/~amit/research/nutch/Nutch-Hadoop-Cluster-Howto.html