Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Searching for a Crawler compatible with HDP

Highlighted

Searching for a Crawler compatible with HDP

New Contributor

Hi there,

I just started with Hadoop for a project where I have to crawl and scrape quite a lot of pages every day. I tried setting up the newest Nutch version with HDP 2.3.2, but due to dependency conflicts it is quite hard to get Nutch to work with HDP, so I was wondering if there is any alternative to Nutch to crawl webpages and whichis compatible with HDP. It would be nice if someone would share her/his experience on what alternatives there are and/or if somebody has an idea about a good replacement for Nutch as option.

Thanks in advance

7 REPLIES 7
Highlighted

Re: Searching for a Crawler compatible with HDP

Re: Searching for a Crawler compatible with HDP

HDP search does only include Solr, but I have used Nutch -> Solr -> HDP/HDFS before and it worked. @Sebastian Droeppelmann what dependency issues did you run into?

Highlighted

Re: Searching for a Crawler compatible with HDP

New Contributor

The problem is I want to save the whole page to do post processing with it. So I think the way I have to go is Nutch -> HBase when I want to keep the page context + Crawler-Metadata, correct me if I'm wrong. The error I initially ran into was

java.lang.ClassNotFoundException: org.apache.gora.hbase.store.HBaseStore

I was able to fix this by building the software against the complete Hortonworks stack. and for testing included all the libs in the nutch job. After that the problem changed to:

java.lang.NoSuchMethodError:
org.apache.hadoop.hbase.HTableDescriptor.addFamily(Lorg/apache/hadoop/hbase/HColumnDescriptor;)V

which I think comes from some incompatibility between gora-0.6.1 and hbase 1.1.2.2.3.2.0-2950. From reading up a lot about the Nutch builds I got the feeling that it's quite shaky if you don't use the exact versions.

It would be way easier, i can imagine if I could save the whole site via Solr.

Highlighted

Re: Searching for a Crawler compatible with HDP

Mentor
@Sebastian Droeppelmann

org.apache.hadoop.hbase.HTableDescriptor

Error signifies you are using deprecated APIs in HBase. Convert your HBase code to 1.x api. You need hbase-client dependency with 1.1.2 as version

Highlighted

Re: Searching for a Crawler compatible with HDP

New Contributor

Did you make any progress? I see Nutch 2.3.1 release support Apache Spark 1.4.1 as backend(supported by Gora), does this look like a feasible approach?

Highlighted

Re: Searching for a Crawler compatible with HDP

New Contributor

If you want to use nutch with HDP, you should find a HDP version with HBASE 0.98.

Nutch can run with HBASE through gora 0.61 but GORA 0.61 does not support HBASE 1.x yet.

You can modify the gora code or wait for GORA 0.7 release.

Highlighted

Re: Searching for a Crawler compatible with HDP

New Contributor

You can index raw HTML content with the lastest version of Nutch into Solr or ElasticSearch with the parameter -addBinaryContent

https://issues.apache.org/jira/browse/NUTCH-1785

https://issues.apache.org/jira/browse/NUTCH-2254

Don't have an account?
Coming from Hortonworks? Activate your account here