Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Load data to HDFS & Data Transformation with Spark

avatar
Rising Star

Hello experts,

I've two simple questions:

In your opinion which is the best way to load data to HDFS (My source data are txt files)? Pig, Sqoop, directly in HDFS, etc. Second question is: Is a good option use Spark to do some data transformation, segmentation? Thanks!

1 ACCEPTED SOLUTION

avatar

As for ingestion, Pig is not really used for simple ingestion and Sqoop is a great tool for importing data from a RDBMS, so the "directly in(to) HDFS" seems like the logical answer and if your data is on an edge/ingestion node where you could easily script just using the hadoop fs "put" command, https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#put, can be a simple, novel & effective way to get your data loaded into HDFS.

As for if Spark is a good option for data transformation (I'm going to side-step the "segmentation" term as it means a lot of different things to a lot of different people 😉 I'd say this is really a matter of style, experience and results of POC testing based on your data & processing profile. So, yes, Spark could be an effective transformation engine.

View solution in original post

3 REPLIES 3

avatar

As for ingestion, Pig is not really used for simple ingestion and Sqoop is a great tool for importing data from a RDBMS, so the "directly in(to) HDFS" seems like the logical answer and if your data is on an edge/ingestion node where you could easily script just using the hadoop fs "put" command, https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#put, can be a simple, novel & effective way to get your data loaded into HDFS.

As for if Spark is a good option for data transformation (I'm going to side-step the "segmentation" term as it means a lot of different things to a lot of different people 😉 I'd say this is really a matter of style, experience and results of POC testing based on your data & processing profile. So, yes, Spark could be an effective transformation engine.

avatar
Rising Star

Hi Lester, many thanks for your attention 🙂 I was thinking use Sqoop to get the correct format of my data but I think it will be better in terms of simplicity and speed put the files directly on HDFS. When I talk about segmentation, I was thiking in clusters analysis, basically divide the date into more smaller data sets. However, I think I can do that in Hive. Many thanks!!!

avatar
Super Collaborator

Other very good ways to load data into HDFS is using Flume or Nifi. "Hadoop fs put" is good but it has some limitation or lack of flexibility that might make it difficult to use it in a production environment.

If you look at the documentation of the Flume HDFS sink for instance ( http://flume.apache.org/FlumeUserGuide.html#hdfs-sink ), you'll see that Flume lets you define how to rotate the files, how to write the file names etc. Other options can be defined for the source (your local text files) or for the channel. "Hadoop fs put" is more basic and doesn't offer those possibilities.