Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Running a web scraper on Hadoop

avatar

For a use case, I am looking to web scrape the prices and additional information of around 25.000 items on a specific website. The names of these items are on a separate list. The resulting prices and additional information then have to be added to the list of the item names.

How can this be implemented best in Hadoop? I thought about using Scrapy [1] on PySpark, then writing a script for joining the prices and the item names. Is this possible?

I suppose Hadoop is not necessarily needed for this small job, but I want to get to know the Hadoop ecosystem better (I'm a Hadoop beginner).

Thanks!

Nicolas

[1] http://scrapy.org/

1 ACCEPTED SOLUTION

avatar

I agree with Vasilis and Enis, both Scrapy and Nutch would be great projects to check out.

Parsing the HTML (ie. extracting price, item names, date, etc.) is one of the more challenging parts of webcrawling. These projects have some built-in functionality, and you may also want to check out html.parser or lxml.

NOTE: If you want to ship external python packages to all nodes in your cluster via pyspark, you will need to reference the .py or .zip file such as sc.addPyFile("xx.zip") in your code.

Once you have parsed the data, the records could be sent to HDFS, stored in HBase, Hive, etc.

View solution in original post

7 REPLIES 7

avatar
Guru

Scrapy is great for gathering the data then you can put the data in Kafka using pythons kafka package or to HDFS using pydoop

Python Hadoop API (PYDOOP)

From there you can create a hive table and write SQL queries to further process the data.

You should get the HDP sandbox and create a VirtualBox environment to try out different routes and see what works best for you. There are a ton of tutorials on here to help you begin to explore Hadoop and get up to speed with all the tools.

avatar

And would you recommend to run Scrapy in a PySpark environment?

I will have a look at the Pydoop API, thanks for the recommendation.

avatar
Guru

Well there isn't a right or wrong answer there. If you want to perform the joins and processing using Spark then pyspark would be very good. However if you just want to run your "regular" python app then just running it on your linux HDP machine is also a good approach. Pyspark is just a way to leverage spark through python but if you aren't planning to use spark in your app it is just overkill.

If you are trying to make progress and want to stick with your python approach stick with what you have but I will say I looked at the Apache Nutch link below from Enis and it looks very cool! I recommend trying the Nutch crawler to Solr tutorial. Solr will let you search and build dashboards ontop of the data you crawl.

avatar

I decided to stick to regular Python, there wasn't really a need for Spark. As I had to get results, I didn't even use Scrapy or Nutch, but I certainly will have a look at it. It looks very interesting!

avatar
Guru

You can take a look at the Apache Nutch project: https://nutch.apache.org/

avatar

I agree with Vasilis and Enis, both Scrapy and Nutch would be great projects to check out.

Parsing the HTML (ie. extracting price, item names, date, etc.) is one of the more challenging parts of webcrawling. These projects have some built-in functionality, and you may also want to check out html.parser or lxml.

NOTE: If you want to ship external python packages to all nodes in your cluster via pyspark, you will need to reference the .py or .zip file such as sc.addPyFile("xx.zip") in your code.

Once you have parsed the data, the records could be sent to HDFS, stored in HBase, Hive, etc.

avatar

I started using lxml, but now I am using the Selenium package of Python.

That note might help me out in the future, thanks for that!