<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Load data to HDFS &amp; Data Transformation with Spark in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Load-data-to-HDFS-Data-Transformation-with-Spark/m-p/151175#M28545</link>
    <description>&lt;P&gt;As for ingestion, Pig is not really used for simple ingestion and Sqoop is a great tool for importing data from a RDBMS, so the "directly in(to) HDFS" seems like the logical answer and if your data is on an edge/ingestion node where you could easily script just using the hadoop fs "put" command, &lt;A href="https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#put" target="_blank"&gt;https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#put&lt;/A&gt;, can be a simple, novel &amp;amp; effective way to get your data loaded into HDFS.&lt;/P&gt;&lt;P&gt;As for if Spark is a good option for data transformation (I'm going to side-step the "segmentation" term as it means a lot of different things to a lot of different people &lt;span class="lia-unicode-emoji" title=":winking_face:"&gt;😉&lt;/span&gt; I'd say this is really a matter of style, experience and results of POC testing based on your data &amp;amp; processing profile.  So, yes, Spark could be an effective transformation engine.&lt;/P&gt;</description>
    <pubDate>Tue, 17 May 2016 06:17:03 GMT</pubDate>
    <dc:creator>LesterMartin</dc:creator>
    <dc:date>2016-05-17T06:17:03Z</dc:date>
    <item>
      <title>Load data to HDFS &amp; Data Transformation with Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Load-data-to-HDFS-Data-Transformation-with-Spark/m-p/151174#M28544</link>
      <description>&lt;P&gt;Hello experts,&lt;/P&gt;&lt;P&gt;I've two simple questions:&lt;/P&gt;&lt;P&gt;In your opinion which is the best way to load data to HDFS (My source data are txt files)? Pig, Sqoop, directly in HDFS, etc.

Second question is: Is a good option use Spark to do some data transformation, segmentation?

Thanks!&lt;/P&gt;</description>
      <pubDate>Tue, 21 Apr 2026 13:30:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Load-data-to-HDFS-Data-Transformation-with-Spark/m-p/151174#M28544</guid>
      <dc:creator>prodgers125</dc:creator>
      <dc:date>2026-04-21T13:30:16Z</dc:date>
    </item>
    <item>
      <title>Re: Load data to HDFS &amp; Data Transformation with Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Load-data-to-HDFS-Data-Transformation-with-Spark/m-p/151175#M28545</link>
      <description>&lt;P&gt;As for ingestion, Pig is not really used for simple ingestion and Sqoop is a great tool for importing data from a RDBMS, so the "directly in(to) HDFS" seems like the logical answer and if your data is on an edge/ingestion node where you could easily script just using the hadoop fs "put" command, &lt;A href="https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#put" target="_blank"&gt;https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#put&lt;/A&gt;, can be a simple, novel &amp;amp; effective way to get your data loaded into HDFS.&lt;/P&gt;&lt;P&gt;As for if Spark is a good option for data transformation (I'm going to side-step the "segmentation" term as it means a lot of different things to a lot of different people &lt;span class="lia-unicode-emoji" title=":winking_face:"&gt;😉&lt;/span&gt; I'd say this is really a matter of style, experience and results of POC testing based on your data &amp;amp; processing profile.  So, yes, Spark could be an effective transformation engine.&lt;/P&gt;</description>
      <pubDate>Tue, 17 May 2016 06:17:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Load-data-to-HDFS-Data-Transformation-with-Spark/m-p/151175#M28545</guid>
      <dc:creator>LesterMartin</dc:creator>
      <dc:date>2016-05-17T06:17:03Z</dc:date>
    </item>
    <item>
      <title>Re: Load data to HDFS &amp; Data Transformation with Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Load-data-to-HDFS-Data-Transformation-with-Spark/m-p/151176#M28546</link>
      <description>&lt;P&gt;Hi Lester, many thanks for your attention &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;  I was thinking use Sqoop to get the correct format of my data but I think it will be better in terms of simplicity and speed put the files directly on HDFS.

When I talk about segmentation, I was thiking in clusters analysis, basically divide the date into more smaller data sets. However, I think I can do that in Hive.

Many thanks!!!&lt;/P&gt;</description>
      <pubDate>Tue, 17 May 2016 16:04:58 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Load-data-to-HDFS-Data-Transformation-with-Spark/m-p/151176#M28546</guid>
      <dc:creator>prodgers125</dc:creator>
      <dc:date>2016-05-17T16:04:58Z</dc:date>
    </item>
    <item>
      <title>Re: Load data to HDFS &amp; Data Transformation with Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Load-data-to-HDFS-Data-Transformation-with-Spark/m-p/151177#M28547</link>
      <description>&lt;P&gt;Other very good ways to load data into HDFS is using Flume or Nifi. "Hadoop fs put" is good but it has some limitation or lack of flexibility that might make it difficult to use it in a production environment.&lt;/P&gt;&lt;P&gt;If you look at the documentation of the Flume HDFS sink for instance ( &lt;A href="http://flume.apache.org/FlumeUserGuide.html#hdfs-sink"&gt;http://flume.apache.org/FlumeUserGuide.html#hdfs-sink&lt;/A&gt; ), you'll see that Flume lets you define how to rotate the files, how to write the file names etc. Other options can be defined for the source (your local text files) or for the channel. "Hadoop fs put" is more basic and doesn't offer those possibilities.&lt;/P&gt;</description>
      <pubDate>Thu, 19 May 2016 14:09:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Load-data-to-HDFS-Data-Transformation-with-Spark/m-p/151177#M28547</guid>
      <dc:creator>sluangsay</dc:creator>
      <dc:date>2016-05-19T14:09:48Z</dc:date>
    </item>
  </channel>
</rss>

