Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Methodology to apply Data Mining in Big Data

avatar
Rising Star

Hello, I’m doing my master thesis and I will apply some Data Mining Techniques in Big Data. I am building a gantt project to organize my methodology I’ve some doubts about how can I target the data from Hadoop to apply the mining functions. There is the options: 1) Divide the big amount of Data, in Hadoop, to 3 data sets (training, test and validation) and then use some Data Mining tools to analyze data 2) Choose a data set of my Data and then use that data set into Data Mining Tool. Which is the normal process to divide the data in a big Data Project? Target Data into Hadoop? My plan is: 1)Storage data into HDFS 2)Storage data into Hive (with some data transformation) 3)Analyze data with Spark and Target Data 4)Load the data set returned in previous setp into a Analytical Software Do you think that this is a good planification? Many thanks!

1 ACCEPTED SOLUTION

avatar
Master Guru

Sounds good to me:

- Loading data in HDFS ( potentially use pig to fix some formatting issues )

- Load data into Hive for some pre analysis and data understanding ( perhaps together with Eclipse Data tools or Ambari views )

- split data into Test and training datasets ( using some randomization function in Hive/pig or directly in Spark )

- Run analysis in Spark MLIB ( good choice )

- Investigate results with visualization tools ( Zeppelin works nice ) rerun modelling as needed.

- Export final results into hive/external database and use a decent BI tool like Tableau ( or if you want it free BIRT/Pentaho ) to visualize results.

Sounds like a very good basic workflow

View solution in original post

4 REPLIES 4

avatar
Master Guru

Sounds good to me:

- Loading data in HDFS ( potentially use pig to fix some formatting issues )

- Load data into Hive for some pre analysis and data understanding ( perhaps together with Eclipse Data tools or Ambari views )

- split data into Test and training datasets ( using some randomization function in Hive/pig or directly in Spark )

- Run analysis in Spark MLIB ( good choice )

- Investigate results with visualization tools ( Zeppelin works nice ) rerun modelling as needed.

- Export final results into hive/external database and use a decent BI tool like Tableau ( or if you want it free BIRT/Pentaho ) to visualize results.

Sounds like a very good basic workflow

avatar
Rising Star

Just fantastic Benjamin! 🙂 Many thanks!!!

avatar
Contributor

Pedro,

You are thinking correctly. The best way to leverage Hadoop is store all raw data in HDFS. The goal it to keep raw data as long as possible (cheap storage) in its original format. Then materialize views of data in Hive (transformations, cleansing and aggregations) for SQL workloads. Now you are ready to analyze data with any enterprise tool that has an ODBC/JDBC interface to connect to Hive (Excel, MicroStrategy, etc.). Spark is also a perfect tool to bring in Hive data for analysis. Try using the Zeppelin Notebook to make it really easy. Any output can be written from Spark back to Hive for consumption by any tool as mentioned above.

I hope this helps.

Eric

avatar
Rising Star

Thanks Eric 🙂 I think that I will have some "troubles" to analyze and segment the data in Spark Step because I will need to create some rules to make that division