Support Questions

Stewart12586 · ‎05-16-2016

Hello, I’m doing my master thesis and I will apply some Data Mining Techniques in Big Data. I am building a gantt project to organize my methodology I’ve some doubts about how can I target the data from Hadoop to apply the mining functions. There is the options: 1) Divide the big amount of Data, in Hadoop, to 3 data sets (training, test and validation) and then use some Data Mining tools to analyze data 2) Choose a data set of my Data and then use that data set into Data Mining Tool. Which is the normal process to divide the data in a big Data Project? Target Data into Hadoop? My plan is: 1)Storage data into HDFS 2)Storage data into Hive (with some data transformation) 3)Analyze data with Spark and Target Data 4)Load the data set returned in previous setp into a Analytical Software Do you think that this is a good planification? Many thanks!

bleonhardi · ‎05-16-2016

Sounds good to me:

- Loading data in HDFS ( potentially use pig to fix some formatting issues )

- Load data into Hive for some pre analysis and data understanding ( perhaps together with Eclipse Data tools or Ambari views )

- split data into Test and training datasets ( using some randomization function in Hive/pig or directly in Spark )

- Run analysis in Spark MLIB ( good choice )

- Investigate results with visualization tools ( Zeppelin works nice ) rerun modelling as needed.

- Export final results into hive/external database and use a decent BI tool like Tableau ( or if you want it free BIRT/Pentaho ) to visualize results.

Sounds like a very good basic workflow

View solution in original post

bleonhardi · ‎05-16-2016

Sounds good to me:

- Loading data in HDFS ( potentially use pig to fix some formatting issues )

- Load data into Hive for some pre analysis and data understanding ( perhaps together with Eclipse Data tools or Ambari views )

- split data into Test and training datasets ( using some randomization function in Hive/pig or directly in Spark )

- Run analysis in Spark MLIB ( good choice )

- Investigate results with visualization tools ( Zeppelin works nice ) rerun modelling as needed.

- Export final results into hive/external database and use a decent BI tool like Tableau ( or if you want it free BIRT/Pentaho ) to visualize results.

Sounds like a very good basic workflow

Stewart12586 · ‎05-16-2016

Just fantastic Benjamin! 🙂 Many thanks!!!

emizell · ‎05-16-2016

Pedro,

You are thinking correctly. The best way to leverage Hadoop is store all raw data in HDFS. The goal it to keep raw data as long as possible (cheap storage) in its original format. Then materialize views of data in Hive (transformations, cleansing and aggregations) for SQL workloads. Now you are ready to analyze data with any enterprise tool that has an ODBC/JDBC interface to connect to Hive (Excel, MicroStrategy, etc.). Spark is also a perfect tool to bring in Hive data for analysis. Try using the Zeppelin Notebook to make it really easy. Any output can be written from Spark back to Hive for consumption by any tool as mentioned above.

I hope this helps.

Eric

Stewart12586 · ‎05-16-2016

Thanks Eric 🙂 I think that I will have some "troubles" to analyze and segment the data in Spark Step because I will need to create some rules to make that division

Cloudera Community

Support Questions

Methodology to apply Data Mining in Big Data

Performance Monitoring In Big Data Hadoop

Big Data DevOps: Apache NiFi Flow Versioning and...

Test Driven Development for Big Data (Unofficial G...

JupyterLab and Spark Connect Quickstart in Clouder...

Notes on Big Data Governance

Big Data DevOps: Apache NiFi - HWX Schema Registry...

How to Create an Iceberg Table with PySpark in Clo...

How to load data from Google Big query table using...

Enterprise Data Quality at Scale with Spark and Gr...

Cloudera Data Engineering Spark Job with Python Wh...