Created on 05-16-2016 12:30 PM - edited 09-16-2022 03:19 AM
Hello, I’m doing my master thesis and I will apply some Data Mining Techniques in Big Data. I am building a gantt project to organize my methodology I’ve some doubts about how can I target the data from Hadoop to apply the mining functions. There is the options: 1) Divide the big amount of Data, in Hadoop, to 3 data sets (training, test and validation) and then use some Data Mining tools to analyze data 2) Choose a data set of my Data and then use that data set into Data Mining Tool. Which is the normal process to divide the data in a big Data Project? Target Data into Hadoop? My plan is: 1)Storage data into HDFS 2)Storage data into Hive (with some data transformation) 3)Analyze data with Spark and Target Data 4)Load the data set returned in previous setp into a Analytical Software Do you think that this is a good planification? Many thanks!
Created 05-16-2016 12:46 PM
Sounds good to me:
- Loading data in HDFS ( potentially use pig to fix some formatting issues )
- Load data into Hive for some pre analysis and data understanding ( perhaps together with Eclipse Data tools or Ambari views )
- split data into Test and training datasets ( using some randomization function in Hive/pig or directly in Spark )
- Run analysis in Spark MLIB ( good choice )
- Investigate results with visualization tools ( Zeppelin works nice ) rerun modelling as needed.
- Export final results into hive/external database and use a decent BI tool like Tableau ( or if you want it free BIRT/Pentaho ) to visualize results.
Sounds like a very good basic workflow
Created 05-16-2016 12:46 PM
Sounds good to me:
- Loading data in HDFS ( potentially use pig to fix some formatting issues )
- Load data into Hive for some pre analysis and data understanding ( perhaps together with Eclipse Data tools or Ambari views )
- split data into Test and training datasets ( using some randomization function in Hive/pig or directly in Spark )
- Run analysis in Spark MLIB ( good choice )
- Investigate results with visualization tools ( Zeppelin works nice ) rerun modelling as needed.
- Export final results into hive/external database and use a decent BI tool like Tableau ( or if you want it free BIRT/Pentaho ) to visualize results.
Sounds like a very good basic workflow
Created 05-16-2016 12:51 PM
Just fantastic Benjamin! 🙂 Many thanks!!!
Created 05-16-2016 01:07 PM
Pedro,
You are thinking correctly. The best way to leverage Hadoop is store all raw data in HDFS. The goal it to keep raw data as long as possible (cheap storage) in its original format. Then materialize views of data in Hive (transformations, cleansing and aggregations) for SQL workloads. Now you are ready to analyze data with any enterprise tool that has an ODBC/JDBC interface to connect to Hive (Excel, MicroStrategy, etc.). Spark is also a perfect tool to bring in Hive data for analysis. Try using the Zeppelin Notebook to make it really easy. Any output can be written from Spark back to Hive for consumption by any tool as mentioned above.
I hope this helps.
Eric
Created 05-16-2016 01:15 PM
Thanks Eric 🙂 I think that I will have some "troubles" to analyze and segment the data in Spark Step because I will need to create some rules to make that division