We are analyzing several options for storing and processing time series. Data can be received via batch or streaming, and is the input for building statistical models on in. The objective of this post is to share some initial knowledge, but also to listen to the community for better approaches regarding TS on big data, as to be instantiated in HDP-HDF stack.
These are different aspects that we pursue to fulfill, and some options we considered:
NiFi + Kafka seem to be the natural options. Not many doubts here, since this excedes older specific tools like Flume or Sqoop.
There are a lot of options here, ranging from the most geneic (HDFS+Hive/SparkSQL combo), through enterprise search hybrids (eg. Solr), and finally specialized time series databases (like Druid). But which one copes best with big data and productivity?
Depending on the previous item, we reckon there could be a mixture of Druid and Spark Streaming, but more classical streaming approaches like Storm or Flink are also to be considered. Too many options.
3) Visualization for analytics
Zeppelin is a nice starting point, but is restricted to data scientists or expert users.
In order to deliver reports and interactive visualizations to end users, options could be:
- KnowAge (ex-SpagoBI) and Pentaho
4) Predictive analytics
Betting on Spark, it is temptating to think of MLLib as . However, AFAIK, it seems to have limited support for time series. A nice extension is Spark-TS (also discussed in Hortonworks Community), but it looks rather quiet for almost 1 year so far. On the other hand, it is not clear that SparkR could deliver massive parallel processing for existing and well documented R time series libraries. Also, model productivization is an item to consider.
I will greatly appreciate your feedback regarding the best options for the above described items, within the Hortonworks stack.