Support Questions

prodgers125 · ‎04-27-2016

Hi experts,

I was used to the usual data warehousing process: Source Date - ETL Now I'm using Hadoop and I'm a bit confusing... I have inserted the data in HDFS but now would like to understand better the data and apply some segmentations ( by profile, for example). I like to use Flume , Spark, Impala and Hive but I am not able to combine well the function of each or when I should apply each them. Does anyone have any idea what are the usual processde Big Data before applying any kind of analytics ? Many thanks!!!

khaslbeck · ‎04-27-2016

This is the common process many go through and many ways to skin the cat here. I prefer the below methodology.

1. Bring in the data with minimal transformation the "E" and "L". Depending on workload this could be sqoop for simple batch or NiFi for a more modern streaming approach with better control over flow, bi-direction and back pressure.

2. Decide on a transformation strategy and store a higher level or "enriched" data set typically in Hive or HBase. Now between Atlas and NiFi you should have some data lineage. Other formatting might take place here with native datatypes dates vs timestamps. Likely a partitioning strategy would take place here. Running a data cleansing strategy at this phase is also a good idea as well as computing feature vectors.

3. Use zeppelin + spark to analyze the data.

View solution in original post

khaslbeck · ‎04-27-2016

This is the common process many go through and many ways to skin the cat here. I prefer the below methodology.

1. Bring in the data with minimal transformation the "E" and "L". Depending on workload this could be sqoop for simple batch or NiFi for a more modern streaming approach with better control over flow, bi-direction and back pressure.

2. Decide on a transformation strategy and store a higher level or "enriched" data set typically in Hive or HBase. Now between Atlas and NiFi you should have some data lineage. Other formatting might take place here with native datatypes dates vs timestamps. Likely a partitioning strategy would take place here. Running a data cleansing strategy at this phase is also a good idea as well as computing feature vectors.

3. Use zeppelin + spark to analyze the data.

prodgers125 · ‎04-27-2016

Hi Kirk, thank you for your brilliant response. So, the data cleansing strategy occurs with Hive and Impala, and only then we use Spark for analyze. Thanks! 🙂

ahadjidj · ‎04-27-2016

Hi @Pedro Alves

You can also use Spark for data cleansing and transformation. The pro is to use the same tool for data preparation, discovery and analysis/ML.

prodgers125 · ‎04-28-2016

Hi Abdelkrim, thanks for your response. In this case I don't have a big nkowlodge about the source data, so what I'm thinking is: -> Put Data in HDFS -> Know the Data with Hive and Impala (simple querys and create some new tables for segmentation) -> Apply some analysis with Spark to identify patterns between data In your opinion, this is a good plan? :) Thanks!

Cloudera Community

Support Questions

Storage data in HDFS - What's next?

Hive hybrid storage mechanism to reduce storage co...

What is Cloudera's Shared Data Experience (SDX)?

How to identify what is consuming space in HDFS

Bringing data storage and data flow closer togethe...

Amount of data storage : HDFS vs NoSQL

Next Steps after deploying Hortonworks Data Platfo...

what is the future about next HDP versions

Using HDFS as local storage for yarn cluster drive...

S020 Data storage error

Storage format in HDFS