Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Load data to HDFS & Data Transformation with Spark

Solved Go to solution

Load data to HDFS & Data Transformation with Spark

Explorer

Hello experts,

I've two simple questions:

In your opinion which is the best way to load data to HDFS (My source data are txt files)? Pig, Sqoop, directly in HDFS, etc. Second question is: Is a good option use Spark to do some data transformation, segmentation? Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Load data to HDFS & Data Transformation with Spark

As for ingestion, Pig is not really used for simple ingestion and Sqoop is a great tool for importing data from a RDBMS, so the "directly in(to) HDFS" seems like the logical answer and if your data is on an edge/ingestion node where you could easily script just using the hadoop fs "put" command, https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#put, can be a simple, novel & effective way to get your data loaded into HDFS.

As for if Spark is a good option for data transformation (I'm going to side-step the "segmentation" term as it means a lot of different things to a lot of different people ;-) I'd say this is really a matter of style, experience and results of POC testing based on your data & processing profile. So, yes, Spark could be an effective transformation engine.

3 REPLIES 3
Highlighted

Re: Load data to HDFS & Data Transformation with Spark

As for ingestion, Pig is not really used for simple ingestion and Sqoop is a great tool for importing data from a RDBMS, so the "directly in(to) HDFS" seems like the logical answer and if your data is on an edge/ingestion node where you could easily script just using the hadoop fs "put" command, https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#put, can be a simple, novel & effective way to get your data loaded into HDFS.

As for if Spark is a good option for data transformation (I'm going to side-step the "segmentation" term as it means a lot of different things to a lot of different people ;-) I'd say this is really a matter of style, experience and results of POC testing based on your data & processing profile. So, yes, Spark could be an effective transformation engine.

Re: Load data to HDFS & Data Transformation with Spark

Explorer

Hi Lester, many thanks for your attention :) I was thinking use Sqoop to get the correct format of my data but I think it will be better in terms of simplicity and speed put the files directly on HDFS. When I talk about segmentation, I was thiking in clusters analysis, basically divide the date into more smaller data sets. However, I think I can do that in Hive. Many thanks!!!

Re: Load data to HDFS & Data Transformation with Spark

Expert Contributor

Other very good ways to load data into HDFS is using Flume or Nifi. "Hadoop fs put" is good but it has some limitation or lack of flexibility that might make it difficult to use it in a production environment.

If you look at the documentation of the Flume HDFS sink for instance ( http://flume.apache.org/FlumeUserGuide.html#hdfs-sink ), you'll see that Flume lets you define how to rotate the files, how to write the file names etc. Other options can be defined for the source (your local text files) or for the channel. "Hadoop fs put" is more basic and doesn't offer those possibilities.

Don't have an account?
Coming from Hortonworks? Activate your account here