Support Questions

depaepe_nicolas · ‎08-19-2016

For a use case, I am looking to web scrape the prices and additional information of around 25.000 items on a specific website. The names of these items are on a separate list. The resulting prices and additional information then have to be added to the list of the item names.

How can this be implemented best in Hadoop? I thought about using Scrapy [1] on PySpark, then writing a script for joining the prices and the item names. Is this possible?

I suppose Hadoop is not necessarily needed for this small job, but I want to get to know the Hadoop ecosystem better (I'm a Hadoop beginner).

Thanks!

Nicolas

[1] http://scrapy.org/

dzaratsian · ‎08-29-2016

I agree with Vasilis and Enis, both Scrapy and Nutch would be great projects to check out.

Parsing the HTML (ie. extracting price, item names, date, etc.) is one of the more challenging parts of webcrawling. These projects have some built-in functionality, and you may also want to check out html.parser or lxml.

NOTE: If you want to ship external python packages to all nodes in your cluster via pyspark, you will need to reference the .py or .zip file such as sc.addPyFile("xx.zip") in your code.

Once you have parsed the data, the records could be sent to HDFS, stored in HBase, Hive, etc.

View solution in original post

vnv · ‎08-22-2016

Scrapy is great for gathering the data then you can put the data in Kafka using pythons kafka package or to HDFS using pydoop

Python Hadoop API (PYDOOP)

From there you can create a hive table and write SQL queries to further process the data.

You should get the HDP sandbox and create a VirtualBox environment to try out different routes and see what works best for you. There are a ton of tutorials on here to help you begin to explore Hadoop and get up to speed with all the tools.

depaepe_nicolas · ‎08-23-2016

And would you recommend to run Scrapy in a PySpark environment?

I will have a look at the Pydoop API, thanks for the recommendation.

vnv · ‎08-23-2016

Well there isn't a right or wrong answer there. If you want to perform the joins and processing using Spark then pyspark would be very good. However if you just want to run your "regular" python app then just running it on your linux HDP machine is also a good approach. Pyspark is just a way to leverage spark through python but if you aren't planning to use spark in your app it is just overkill.

If you are trying to make progress and want to stick with your python approach stick with what you have but I will say I looked at the Apache Nutch link below from Enis and it looks very cool! I recommend trying the Nutch crawler to Solr tutorial. Solr will let you search and build dashboards ontop of the data you crawl.

depaepe_nicolas · ‎08-30-2016

I decided to stick to regular Python, there wasn't really a need for Spark. As I had to get results, I didn't even use Scrapy or Nutch, but I certainly will have a look at it. It looks very interesting!

Enis · ‎08-23-2016

You can take a look at the Apache Nutch project: https://nutch.apache.org/

dzaratsian · ‎08-29-2016

I agree with Vasilis and Enis, both Scrapy and Nutch would be great projects to check out.

Parsing the HTML (ie. extracting price, item names, date, etc.) is one of the more challenging parts of webcrawling. These projects have some built-in functionality, and you may also want to check out html.parser or lxml.

NOTE: If you want to ship external python packages to all nodes in your cluster via pyspark, you will need to reference the .py or .zip file such as sc.addPyFile("xx.zip") in your code.

Once you have parsed the data, the records could be sent to HDFS, stored in HBase, Hive, etc.

depaepe_nicolas · ‎08-30-2016

I started using lxml, but now I am using the Selenium package of Python.

That note might help me out in the future, thanks for that!

Cloudera Community

Support Questions

Running a web scraper on Hadoop

Python Hadoop API (PYDOOP)

Kerberos authentication from MacOS Monterey to acc...

Hadoop not running tasks

Run Hadoop Happily in O.S. Firewall Controlled Env...

Hadoop Security Concepts

Running Spark Application on a Kerberized Hadoop c...

Authentication for Hadoop HTTP web-consoles

Hadoop - "To Run, or Not To Run" in the Cloud?

Comprehensive understanding of "No GC" pauses in h...

Small Files in Hadoop

Hadoop & devOps : better together