I just started with Hadoop for a project where I have to crawl and scrape quite a lot of pages every day. I tried setting up the newest Nutch version with HDP 2.3.2, but due to dependency conflicts it is quite hard to get Nutch to work with HDP, so I was wondering if there is any alternative to Nutch to crawl webpages and whichis compatible with HDP. It would be nice if someone would share her/his experience on what alternatives there are and/or if somebody has an idea about a good replacement for Nutch as option.
Thanks in advance
The problem is I want to save the whole page to do post processing with it. So I think the way I have to go is Nutch -> HBase when I want to keep the page context + Crawler-Metadata, correct me if I'm wrong. The error I initially ran into was
I was able to fix this by building the software against the complete Hortonworks stack. and for testing included all the libs in the nutch job. After that the problem changed to:
which I think comes from some incompatibility between gora-0.6.1 and hbase 22.214.171.124.3.2.0-2950. From reading up a lot about the Nutch builds I got the feeling that it's quite shaky if you don't use the exact versions.
It would be way easier, i can imagine if I could save the whole site via Solr.
Error signifies you are using deprecated APIs in HBase. Convert your HBase code to 1.x api. You need hbase-client dependency with 1.1.2 as version
Did you make any progress? I see Nutch 2.3.1 release support Apache Spark 1.4.1 as backend(supported by Gora), does this look like a feasible approach?
If you want to use nutch with HDP, you should find a HDP version with HBASE 0.98.
Nutch can run with HBASE through gora 0.61 but GORA 0.61 does not support HBASE 1.x yet.
You can modify the gora code or wait for GORA 0.7 release.