Created on 05-16-2016 09:05 PM - edited 09-16-2022 03:20 AM
Hello experts,
I've two simple questions:
In your opinion which is the best way to load data to HDFS (My source data are txt files)? Pig, Sqoop, directly in HDFS, etc. Second question is: Is a good option use Spark to do some data transformation, segmentation? Thanks!
Created 05-16-2016 11:17 PM
As for ingestion, Pig is not really used for simple ingestion and Sqoop is a great tool for importing data from a RDBMS, so the "directly in(to) HDFS" seems like the logical answer and if your data is on an edge/ingestion node where you could easily script just using the hadoop fs "put" command, https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#put, can be a simple, novel & effective way to get your data loaded into HDFS.
As for if Spark is a good option for data transformation (I'm going to side-step the "segmentation" term as it means a lot of different things to a lot of different people 😉 I'd say this is really a matter of style, experience and results of POC testing based on your data & processing profile. So, yes, Spark could be an effective transformation engine.
Created 05-16-2016 11:17 PM
As for ingestion, Pig is not really used for simple ingestion and Sqoop is a great tool for importing data from a RDBMS, so the "directly in(to) HDFS" seems like the logical answer and if your data is on an edge/ingestion node where you could easily script just using the hadoop fs "put" command, https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#put, can be a simple, novel & effective way to get your data loaded into HDFS.
As for if Spark is a good option for data transformation (I'm going to side-step the "segmentation" term as it means a lot of different things to a lot of different people 😉 I'd say this is really a matter of style, experience and results of POC testing based on your data & processing profile. So, yes, Spark could be an effective transformation engine.
Created 05-17-2016 09:04 AM
Hi Lester, many thanks for your attention 🙂 I was thinking use Sqoop to get the correct format of my data but I think it will be better in terms of simplicity and speed put the files directly on HDFS. When I talk about segmentation, I was thiking in clusters analysis, basically divide the date into more smaller data sets. However, I think I can do that in Hive. Many thanks!!!
Created 05-19-2016 07:09 AM
Other very good ways to load data into HDFS is using Flume or Nifi. "Hadoop fs put" is good but it has some limitation or lack of flexibility that might make it difficult to use it in a production environment.
If you look at the documentation of the Flume HDFS sink for instance ( http://flume.apache.org/FlumeUserGuide.html#hdfs-sink ), you'll see that Flume lets you define how to rotate the files, how to write the file names etc. Other options can be defined for the source (your local text files) or for the channel. "Hadoop fs put" is more basic and doesn't offer those possibilities.