Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Best options within HDP/HDF for handling time series

Best options within HDP/HDF for handling time series

Rising Star

We are analyzing several options for storing and processing time series. Data can be received via batch or streaming, and is the input for building statistical models on in. The objective of this post is to share some initial knowledge, but also to listen to the community for better approaches regarding TS on big data, as to be instantiated in HDP-HDF stack.

These are different aspects that we pursue to fulfill, and some options we considered:

0) Acquisition

NiFi + Kafka seem to be the natural options. Not many doubts here, since this excedes older specific tools like Flume or Sqoop.

1) Storage

There are a lot of options here, ranging from the most geneic (HDFS+Hive/SparkSQL combo), through enterprise search hybrids (eg. Solr), and finally specialized time series databases (like Druid). But which one copes best with big data and productivity?

2) Processing

Depending on the previous item, we reckon there could be a mixture of Druid and Spark Streaming, but more classical streaming approaches like Storm or Flink are also to be considered. Too many options.

3) Visualization for analytics

Zeppelin is a nice starting point, but is restricted to data scientists or expert users.

In order to deliver reports and interactive visualizations to end users, options could be:

- KnowAge (ex-SpagoBI) and Pentaho

- Javascript library based custom visualizations (D3, etc.)

- other?

4) Predictive analytics

Betting on Spark, it is temptating to think of MLLib as . However, AFAIK, it seems to have limited support for time series. A nice extension is Spark-TS (also discussed in Hortonworks Community), but it looks rather quiet for almost 1 year so far. On the other hand, it is not clear that SparkR could deliver massive parallel processing for existing and well documented R time series libraries. Also, model productivization is an item to consider.

I will greatly appreciate your feedback regarding the best options for the above described items, within the Hortonworks stack.

Don't have an account?
Coming from Hortonworks? Activate your account here