Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Storage data in HDFS - What's next?

avatar
Rising Star

Hi experts,

I was used to the usual data warehousing process: Source Date - ETL Now I'm using Hadoop and I'm a bit confusing... I have inserted the data in HDFS but now would like to understand better the data and apply some segmentations ( by profile, for example). I like to use Flume , Spark, Impala and Hive but I am not able to combine well the function of each or when I should apply each them. Does anyone have any idea what are the usual processde Big Data before applying any kind of analytics ? Many thanks!!!

1 ACCEPTED SOLUTION

avatar
Expert Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
4 REPLIES 4

avatar
Expert Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Rising Star

Hi Kirk, thank you for your brilliant response. So, the data cleansing strategy occurs with Hive and Impala, and only then we use Spark for analyze. Thanks! 🙂

avatar

Hi @Pedro Alves

You can also use Spark for data cleansing and transformation. The pro is to use the same tool for data preparation, discovery and analysis/ML.

avatar
Rising Star

Hi Abdelkrim, thanks for your response. In this case I don't have a big nkowlodge about the source data, so what I'm thinking is: -> Put Data in HDFS -> Know the Data with Hive and Impala (simple querys and create some new tables for segmentation) -> Apply some analysis with Spark to identify patterns between data In your opinion, this is a good plan? :) Thanks!