Created 04-27-2016 05:28 PM
Hi experts,
I was used to the usual data warehousing process: Source Date - ETL Now I'm using Hadoop and I'm a bit confusing... I have inserted the data in HDFS but now would like to understand better the data and apply some segmentations ( by profile, for example). I like to use Flume , Spark, Impala and Hive but I am not able to combine well the function of each or when I should apply each them. Does anyone have any idea what are the usual processde Big Data before applying any kind of analytics ? Many thanks!!!
Created 04-27-2016 05:40 PM
Want to get a detailed solution you have to login/registered on the community
Register/LoginCreated 04-27-2016 05:40 PM
Want to get a detailed solution you have to login/registered on the community
Register/LoginCreated 04-27-2016 08:30 PM
Hi Kirk, thank you for your brilliant response. So, the data cleansing strategy occurs with Hive and Impala, and only then we use Spark for analyze. Thanks! 🙂
Created 04-27-2016 09:01 PM
Hi @Pedro Alves
You can also use Spark for data cleansing and transformation. The pro is to use the same tool for data preparation, discovery and analysis/ML.
Created 04-28-2016 08:52 AM
Hi Abdelkrim, thanks for your response. In this case I don't have a big nkowlodge about the source data, so what I'm thinking is: -> Put Data in HDFS -> Know the Data with Hive and Impala (simple querys and create some new tables for segmentation) -> Apply some analysis with Spark to identify patterns between data In your opinion, this is a good plan? :) Thanks!