question Re: Running a web scraper on Hadoop in Archives of Support Questions (Read Only)

Running a web scraper on Hadoop

depaepe_nicolas — Fri, 19 Aug 2016 19:59:15 GMT

For a use case, I am looking to web scrape the prices and additional information of around 25.000 items on a specific website. The names of these items are on a separate list. The resulting prices and additional information then have to be added to the list of the item names.

How can this be implemented best in Hadoop? I thought about using Scrapy [1] on PySpark, then writing a script for joining the prices and the item names. Is this possible?

I suppose Hadoop is not necessarily needed for this small job, but I want to get to know the Hadoop ecosystem better (I'm a Hadoop beginner).

Thanks!

Nicolas

[1] http://scrapy.org/

Re: Running a web scraper on Hadoop

vnv — Tue, 23 Aug 2016 00:19:57 GMT

Scrapy is great for gathering the data then you can put the data in Kafka using pythons kafka package or to HDFS using pydoop

Python Hadoop API (PYDOOP)

From there you can create a hive table and write SQL queries to further process the data.

You should get the HDP sandbox and create a VirtualBox environment to try out different routes and see what works best for you. There are a ton of tutorials on here to help you begin to explore Hadoop and get up to speed with all the tools.

Re: Running a web scraper on Hadoop

depaepe_nicolas — Tue, 23 Aug 2016 14:01:58 GMT

And would you recommend to run Scrapy in a PySpark environment?

I will have a look at the Pydoop API, thanks for the recommendation.

Re: Running a web scraper on Hadoop

Enis — Tue, 23 Aug 2016 20:33:07 GMT

You can take a look at the Apache Nutch project: https://nutch.apache.org/

Re: Running a web scraper on Hadoop

vnv — Tue, 23 Aug 2016 21:38:56 GMT

Well there isn't a right or wrong answer there. If you want to perform the joins and processing using Spark then pyspark would be very good. However if you just want to run your "regular" python app then just running it on your linux HDP machine is also a good approach. Pyspark is just a way to leverage spark through python but if you aren't planning to use spark in your app it is just overkill.

If you are trying to make progress and want to stick with your python approach stick with what you have but I will say I looked at the Apache Nutch link below from Enis and it looks very cool! I recommend trying the Nutch crawler to Solr tutorial. Solr will let you search and build dashboards ontop of the data you crawl.

Re: Running a web scraper on Hadoop

dzaratsian — Tue, 30 Aug 2016 01:35:38 GMT

I agree with Vasilis and Enis, both Scrapy and Nutch would be great projects to check out.

Parsing the HTML (ie. extracting price, item names, date, etc.) is one of the more challenging parts of webcrawling. These projects have some built-in functionality, and you may also want to check out html.parser or lxml.

NOTE: If you want to ship external python packages to all nodes in your cluster via pyspark, you will need to reference the .py or .zip file such as sc.addPyFile("xx.zip") in your code.

Once you have parsed the data, the records could be sent to HDFS, stored in HBase, Hive, etc.

Re: Running a web scraper on Hadoop

depaepe_nicolas — Tue, 30 Aug 2016 13:52:23 GMT

I decided to stick to regular Python, there wasn't really a need for Spark. As I had to get results, I didn't even use Scrapy or Nutch, but I certainly will have a look at it. It looks very interesting!

Re: Running a web scraper on Hadoop

depaepe_nicolas — Tue, 30 Aug 2016 13:55:21 GMT

I started using lxml, but now I am using the Selenium package of Python.

That note might help me out in the future, thanks for that!