<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Loading data to HDFS - Pig or Spark? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Loading-data-to-HDFS-Pig-or-Spark/m-p/130192#M31325</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/11031/m2014227.html" nodeid="11031"&gt;@Johnny Fugers&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Have you considered using NiFi to load the data? You can read from many different sources, merge the content into large enough portions to optimize the HDFS use, and write the data directly into HDFS.&lt;/P&gt;</description>
    <pubDate>Thu, 09 Jun 2016 21:08:52 GMT</pubDate>
    <dc:creator>emaxwell</dc:creator>
    <dc:date>2016-06-09T21:08:52Z</dc:date>
    <item>
      <title>Loading data to HDFS - Pig or Spark?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Loading-data-to-HDFS-Pig-or-Spark/m-p/130191#M31324</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I want to load some .csv files to HDFS. I already decide that I want to do, in next step, some data transformation with Spark. My question is: I've some advantage to use PIG instead Spark for load data into HDFS?&lt;/P&gt;&lt;P&gt;Many thanks!&lt;/P&gt;</description>
      <pubDate>Thu, 09 Jun 2016 21:04:30 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Loading-data-to-HDFS-Pig-or-Spark/m-p/130191#M31324</guid>
      <dc:creator>m2014227</dc:creator>
      <dc:date>2016-06-09T21:04:30Z</dc:date>
    </item>
    <item>
      <title>Re: Loading data to HDFS - Pig or Spark?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Loading-data-to-HDFS-Pig-or-Spark/m-p/130192#M31325</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/11031/m2014227.html" nodeid="11031"&gt;@Johnny Fugers&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Have you considered using NiFi to load the data? You can read from many different sources, merge the content into large enough portions to optimize the HDFS use, and write the data directly into HDFS.&lt;/P&gt;</description>
      <pubDate>Thu, 09 Jun 2016 21:08:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Loading-data-to-HDFS-Pig-or-Spark/m-p/130192#M31325</guid>
      <dc:creator>emaxwell</dc:creator>
      <dc:date>2016-06-09T21:08:52Z</dc:date>
    </item>
    <item>
      <title>Re: Loading data to HDFS - Pig or Spark?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Loading-data-to-HDFS-Pig-or-Spark/m-p/130193#M31326</link>
      <description>&lt;P&gt;In my opinion, there are 2 advantages to using Pig for data loading:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;You are more comfortable programming in Pig.&lt;/LI&gt;&lt;LI&gt;You have existing User Defined Functions (UDFs) that you want to use.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Outside of that, you should be able to what you want to with Spark with faster execution times and arguably much more flexibility.  The advantage that Spark provides is the ability to use Java, Scala or Python as the language of choice.  You can also you use SQL with Spark, which is something you can't do with Pig.&lt;/P&gt;&lt;P&gt;If you are starting from scratch, give Spark a try.&lt;/P&gt;</description>
      <pubDate>Thu, 09 Jun 2016 21:11:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Loading-data-to-HDFS-Pig-or-Spark/m-p/130193#M31326</guid>
      <dc:creator>myoung</dc:creator>
      <dc:date>2016-06-09T21:11:43Z</dc:date>
    </item>
    <item>
      <title>Re: Loading data to HDFS - Pig or Spark?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Loading-data-to-HDFS-Pig-or-Spark/m-p/130194#M31327</link>
      <description>What I read is a good choice use the same tool for all the steps inside Hadoop. If NiFi gives me that advantages I will study more about it &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt; Basically, in your opinion I should use:
&lt;UL&gt;&lt;LI&gt;NiFi to load data into HDFS&lt;/LI&gt;&lt;LI&gt;Spark to do some data transformation (or maybe load data into Hive)
&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Thanks! &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 09 Jun 2016 21:12:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Loading-data-to-HDFS-Pig-or-Spark/m-p/130194#M31327</guid>
      <dc:creator>m2014227</dc:creator>
      <dc:date>2016-06-09T21:12:33Z</dc:date>
    </item>
    <item>
      <title>Re: Loading data to HDFS - Pig or Spark?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Loading-data-to-HDFS-Pig-or-Spark/m-p/130195#M31328</link>
      <description>&lt;P&gt;Each of the components within the Hadoop stack has advantages and disadvantages for various tasks.  My recommendation is to try to use the best tool for the job while minimizing the amount of complexity you are dealing with.  I think if you try to use "the same tool for all the steps" as you read, you may find that your process works, however it may not be optimal.&lt;/P&gt;&lt;P&gt;NiFi is an excellent approach as well, as emaxwell suggested.&lt;/P&gt;</description>
      <pubDate>Thu, 09 Jun 2016 21:24:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Loading-data-to-HDFS-Pig-or-Spark/m-p/130195#M31328</guid>
      <dc:creator>myoung</dc:creator>
      <dc:date>2016-06-09T21:24:13Z</dc:date>
    </item>
  </channel>
</rss>

