question Re: Load data to HDFS & Data Transformation with Spark in Archives of Support Questions (Read Only)

Load data to HDFS & Data Transformation with Spark

prodgers125 — Tue, 21 Apr 2026 13:30:16 GMT

Hello experts,

I've two simple questions:

In your opinion which is the best way to load data to HDFS (My source data are txt files)? Pig, Sqoop, directly in HDFS, etc. Second question is: Is a good option use Spark to do some data transformation, segmentation? Thanks!

Re: Load data to HDFS & Data Transformation with Spark

LesterMartin — Tue, 17 May 2016 06:17:03 GMT

As for ingestion, Pig is not really used for simple ingestion and Sqoop is a great tool for importing data from a RDBMS, so the "directly in(to) HDFS" seems like the logical answer and if your data is on an edge/ingestion node where you could easily script just using the hadoop fs "put" command, https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#put, can be a simple, novel & effective way to get your data loaded into HDFS.

As for if Spark is a good option for data transformation (I'm going to side-step the "segmentation" term as it means a lot of different things to a lot of different people 😉 I'd say this is really a matter of style, experience and results of POC testing based on your data & processing profile. So, yes, Spark could be an effective transformation engine.

Re: Load data to HDFS & Data Transformation with Spark

prodgers125 — Tue, 17 May 2016 16:04:58 GMT

Hi Lester, many thanks for your attention 🙂 I was thinking use Sqoop to get the correct format of my data but I think it will be better in terms of simplicity and speed put the files directly on HDFS. When I talk about segmentation, I was thiking in clusters analysis, basically divide the date into more smaller data sets. However, I think I can do that in Hive. Many thanks!!!

Re: Load data to HDFS & Data Transformation with Spark

sluangsay — Thu, 19 May 2016 14:09:48 GMT

Other very good ways to load data into HDFS is using Flume or Nifi. "Hadoop fs put" is good but it has some limitation or lack of flexibility that might make it difficult to use it in a production environment.

If you look at the documentation of the Flume HDFS sink for instance ( http://flume.apache.org/FlumeUserGuide.html#hdfs-sink ), you'll see that Flume lets you define how to rotate the files, how to write the file names etc. Other options can be defined for the source (your local text files) or for the channel. "Hadoop fs put" is more basic and doesn't offer those possibilities.