Created 10-07-2016 11:08 AM
I have a big project to analyse the client’s history, records in order to predict and manage the resources more efficiently i.e discovering patterns from data to gain an insight for future prediction. This has to deal with millions of data sets flew from various resources. My question here, what is the best tool / predictive analysis technique allow me to achieve this? How do I need deploy the tool that will allow to optimise the data I have?
Any help would be appreciated!
Created 10-07-2016 11:45 AM
If you are using HDP, all of the tools discussed below are deployed when you install the distribution.
Store your data
Definitely store your data in Hadoop. Spend some time thinking about how you will organize this from a file system perspective.
http://hortonworks.com/apache/hdfs/
Sqoop is a fast and effective way to pull your data from relational databases into hadoop.
http://hortonworks.com/apache/sqoop/
Clean your data
You may need to clean or transform that data after it has landed in Hadoop, e.g. trimming leading and trailing whitespaces, removing non-ascii characters. Pig scripts can do this quick and effectively. If you do have to clean the data, keep the raw data in one zone (HDFS directory) and clean it to a destination zone.
http://hortonworks.com/apache/pig/
Analyze and visualize your data
You most likely will want to use Spark to do your predictive analysis. Spark is deployed with HDP. It is an in-memory processing engine with libraries to easily perform sql and machine learning/predictive analysis against your data. Being in-memory, analysis of GBs of data is very rapid. These libraries are accessed with Java, Scala or Python APIs. (There are also streaming and graph capabilities, but it looks like you will not need these for your analysis).
https://hortonworks.com/apache/spark/
Zeppelin is an awesome UI to perform Spark analyses. It is a notebook style UI -- it is browser based and composed of separate "paragraphs" which are areas to perform separate steps of your analysis. Each paragraph is loaded with an interpreter. These interpreters allow you to write shell commands directly against the linux box hosting the Zeppelin server, or to perform your predictive analysis using Spark's sql and machine learning / predictive analyses. Zeppelin also has easy to use visualization capabilities.
https://hortonworks.com/apache/zeppelin/
You may want to use Hive to perform complex SQL against your data. Hive is a SQL engine on Hadoop that is very effective in analyzing huge volumes of both structured and unstructured data. (Spark can reach limits on huge data sizes). For example, you can analyze tweets where fields in the hive table are json strings. Or you can do complex joins across multiple tables. Hive is not as fast as Spark, but it is solid against any volume of data and complexity of query. Having said that, Hive performance has increased greatly in the past few years ... largely by implementation of the Tez engine, ORC file format, and in-memory LLAP. You can build Hive tables from Spark and analyze from both, or you can build Hive tables through Hive and also analyze in Spark.
http://hortonworks.com/apache/hive/
General
As mentioned, all of the above tools come out of the box with HDP (current version is 2.5). You can run your analysis from either a browser-based UI (Zeppelin, Ambari views) or from the command line from server in the cluster (you may want to set up a specialized "edge node" to perform analysis from the command line).
Your Approach
It sounds like you are about to launch on a very large project. Be sure to start small by working with small samples of your data to learn the technology and to understand how best to design how you store and analyze the data.
You can get a quick start by downloading the sandbox and following tutorials.
Created 10-07-2016 11:45 AM
If you are using HDP, all of the tools discussed below are deployed when you install the distribution.
Store your data
Definitely store your data in Hadoop. Spend some time thinking about how you will organize this from a file system perspective.
http://hortonworks.com/apache/hdfs/
Sqoop is a fast and effective way to pull your data from relational databases into hadoop.
http://hortonworks.com/apache/sqoop/
Clean your data
You may need to clean or transform that data after it has landed in Hadoop, e.g. trimming leading and trailing whitespaces, removing non-ascii characters. Pig scripts can do this quick and effectively. If you do have to clean the data, keep the raw data in one zone (HDFS directory) and clean it to a destination zone.
http://hortonworks.com/apache/pig/
Analyze and visualize your data
You most likely will want to use Spark to do your predictive analysis. Spark is deployed with HDP. It is an in-memory processing engine with libraries to easily perform sql and machine learning/predictive analysis against your data. Being in-memory, analysis of GBs of data is very rapid. These libraries are accessed with Java, Scala or Python APIs. (There are also streaming and graph capabilities, but it looks like you will not need these for your analysis).
https://hortonworks.com/apache/spark/
Zeppelin is an awesome UI to perform Spark analyses. It is a notebook style UI -- it is browser based and composed of separate "paragraphs" which are areas to perform separate steps of your analysis. Each paragraph is loaded with an interpreter. These interpreters allow you to write shell commands directly against the linux box hosting the Zeppelin server, or to perform your predictive analysis using Spark's sql and machine learning / predictive analyses. Zeppelin also has easy to use visualization capabilities.
https://hortonworks.com/apache/zeppelin/
You may want to use Hive to perform complex SQL against your data. Hive is a SQL engine on Hadoop that is very effective in analyzing huge volumes of both structured and unstructured data. (Spark can reach limits on huge data sizes). For example, you can analyze tweets where fields in the hive table are json strings. Or you can do complex joins across multiple tables. Hive is not as fast as Spark, but it is solid against any volume of data and complexity of query. Having said that, Hive performance has increased greatly in the past few years ... largely by implementation of the Tez engine, ORC file format, and in-memory LLAP. You can build Hive tables from Spark and analyze from both, or you can build Hive tables through Hive and also analyze in Spark.
http://hortonworks.com/apache/hive/
General
As mentioned, all of the above tools come out of the box with HDP (current version is 2.5). You can run your analysis from either a browser-based UI (Zeppelin, Ambari views) or from the command line from server in the cluster (you may want to set up a specialized "edge node" to perform analysis from the command line).
Your Approach
It sounds like you are about to launch on a very large project. Be sure to start small by working with small samples of your data to learn the technology and to understand how best to design how you store and analyze the data.
You can get a quick start by downloading the sandbox and following tutorials.