<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Running a web scraper on Hadoop in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Running-a-web-scraper-on-Hadoop/m-p/111936#M38380</link>
    <description>&lt;P&gt;You can take a look at the Apache Nutch project: &lt;A href="https://nutch.apache.org/" target="_blank"&gt;https://nutch.apache.org/&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 23 Aug 2016 20:33:07 GMT</pubDate>
    <dc:creator>Enis</dc:creator>
    <dc:date>2016-08-23T20:33:07Z</dc:date>
    <item>
      <title>Running a web scraper on Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Running-a-web-scraper-on-Hadoop/m-p/111933#M38377</link>
      <description>&lt;P&gt;For a use case, I am looking to web scrape the prices and additional information of around 25.000 items on a specific website. The names of these items are on a separate list. The resulting prices and additional information then have to be added to the list of the item names.&lt;/P&gt;&lt;P&gt;How can this be implemented best in Hadoop? I thought about using Scrapy [1] on PySpark, then writing a script for joining the prices and the item names. Is this possible?&lt;/P&gt;&lt;P&gt;I suppose Hadoop is not necessarily needed for this small job, but I want to get to know the Hadoop ecosystem better (I'm a Hadoop beginner).&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;Nicolas&lt;/P&gt;&lt;P&gt;[1] &lt;A href="http://scrapy.org/" target="_blank"&gt;http://scrapy.org/&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 19 Aug 2016 19:59:15 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Running-a-web-scraper-on-Hadoop/m-p/111933#M38377</guid>
      <dc:creator>depaepe_nicolas</dc:creator>
      <dc:date>2016-08-19T19:59:15Z</dc:date>
    </item>
    <item>
      <title>Re: Running a web scraper on Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Running-a-web-scraper-on-Hadoop/m-p/111934#M38378</link>
      <description>&lt;P&gt;Scrapy is great for gathering the data then you can put the data in Kafka using pythons kafka package or to HDFS using pydoop &lt;/P&gt;&lt;H4&gt;&lt;A href="https://community.hortonworks.com/repos/30886/python-hadoop-api-pydoop.html"&gt;Python Hadoop API (PYDOOP)&lt;/A&gt;&lt;/H4&gt;&lt;P&gt; From there you can create a hive table and write SQL queries to further process the data. &lt;/P&gt;&lt;P&gt;You should get the HDP sandbox and create a VirtualBox environment to try out different routes and see what works best for you. There are a ton of tutorials on here to help you begin to explore Hadoop and get up to speed with all the tools. &lt;/P&gt;</description>
      <pubDate>Tue, 23 Aug 2016 00:19:57 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Running-a-web-scraper-on-Hadoop/m-p/111934#M38378</guid>
      <dc:creator>vnv</dc:creator>
      <dc:date>2016-08-23T00:19:57Z</dc:date>
    </item>
    <item>
      <title>Re: Running a web scraper on Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Running-a-web-scraper-on-Hadoop/m-p/111935#M38379</link>
      <description>&lt;P&gt;And would you recommend to run Scrapy in a PySpark environment?&lt;/P&gt;&lt;P&gt;I will have a look at the Pydoop API, thanks for the recommendation.&lt;/P&gt;</description>
      <pubDate>Tue, 23 Aug 2016 14:01:58 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Running-a-web-scraper-on-Hadoop/m-p/111935#M38379</guid>
      <dc:creator>depaepe_nicolas</dc:creator>
      <dc:date>2016-08-23T14:01:58Z</dc:date>
    </item>
    <item>
      <title>Re: Running a web scraper on Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Running-a-web-scraper-on-Hadoop/m-p/111936#M38380</link>
      <description>&lt;P&gt;You can take a look at the Apache Nutch project: &lt;A href="https://nutch.apache.org/" target="_blank"&gt;https://nutch.apache.org/&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 23 Aug 2016 20:33:07 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Running-a-web-scraper-on-Hadoop/m-p/111936#M38380</guid>
      <dc:creator>Enis</dc:creator>
      <dc:date>2016-08-23T20:33:07Z</dc:date>
    </item>
    <item>
      <title>Re: Running a web scraper on Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Running-a-web-scraper-on-Hadoop/m-p/111937#M38381</link>
      <description>&lt;P&gt;Well there isn't a right or wrong answer there. If you want to perform the joins and processing using Spark then pyspark would be very good. However if you just want to run your "regular" python app then just running it on your linux HDP machine is also a good approach. Pyspark is just a way to leverage spark through python but if you aren't planning to use spark in your app it is just overkill. &lt;/P&gt;&lt;P&gt;If you are trying to make progress and want to stick with your python approach stick with what you have but I will say I looked at the Apache Nutch link below from Enis and it looks very cool! I recommend trying the Nutch crawler to Solr tutorial. Solr will let you search and build dashboards ontop of the data you crawl. &lt;/P&gt;</description>
      <pubDate>Tue, 23 Aug 2016 21:38:56 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Running-a-web-scraper-on-Hadoop/m-p/111937#M38381</guid>
      <dc:creator>vnv</dc:creator>
      <dc:date>2016-08-23T21:38:56Z</dc:date>
    </item>
    <item>
      <title>Re: Running a web scraper on Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Running-a-web-scraper-on-Hadoop/m-p/111938#M38382</link>
      <description>&lt;P&gt;I agree with Vasilis and Enis, both Scrapy and Nutch would be great projects to check out.&lt;/P&gt;&lt;P&gt;Parsing the HTML (ie. extracting price, item names, date, etc.) is one of the more challenging parts of webcrawling. These projects have some built-in functionality, and you may also want to check out &lt;A href="https://docs.python.org/3/library/html.parser.html"&gt;html.parser&lt;/A&gt; or &lt;A href="http://lxml.de/"&gt;lxml&lt;/A&gt;. 
&lt;/P&gt;&lt;P&gt;NOTE: If you want to ship external python packages to all nodes in your cluster via pyspark, you will need to reference the .py or .zip file such as &lt;STRONG&gt;sc.&lt;/STRONG&gt;&lt;STRONG&gt;addPyFile("xx.zip")&lt;/STRONG&gt; in your code.&lt;/P&gt;&lt;P&gt;Once you have parsed the data, the records could be sent to HDFS, stored in HBase, Hive, etc.&lt;/P&gt;</description>
      <pubDate>Tue, 30 Aug 2016 01:35:38 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Running-a-web-scraper-on-Hadoop/m-p/111938#M38382</guid>
      <dc:creator>dzaratsian</dc:creator>
      <dc:date>2016-08-30T01:35:38Z</dc:date>
    </item>
    <item>
      <title>Re: Running a web scraper on Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Running-a-web-scraper-on-Hadoop/m-p/111939#M38383</link>
      <description>&lt;P&gt;I decided to stick to regular Python, there wasn't really a need for Spark. As I had to get results, I didn't even use Scrapy or Nutch, but I certainly will have a look at it. It looks very interesting!&lt;/P&gt;</description>
      <pubDate>Tue, 30 Aug 2016 13:52:23 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Running-a-web-scraper-on-Hadoop/m-p/111939#M38383</guid>
      <dc:creator>depaepe_nicolas</dc:creator>
      <dc:date>2016-08-30T13:52:23Z</dc:date>
    </item>
    <item>
      <title>Re: Running a web scraper on Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Running-a-web-scraper-on-Hadoop/m-p/111940#M38384</link>
      <description>&lt;P&gt;I started using lxml, but now I am using the Selenium package of Python.&lt;/P&gt;&lt;P&gt;That note might help me out in the future, thanks for that!&lt;/P&gt;</description>
      <pubDate>Tue, 30 Aug 2016 13:55:21 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Running-a-web-scraper-on-Hadoop/m-p/111940#M38384</guid>
      <dc:creator>depaepe_nicolas</dc:creator>
      <dc:date>2016-08-30T13:55:21Z</dc:date>
    </item>
  </channel>
</rss>

