our use case include high scale logs ingestion (currently done via custom log collector with hornetq) and write to protobuf sequence file in HDFS (currently we are still in hdfs 1.x...) , and M/R jobs via SOSScheduler to MySql DB , and query on the DB .
in parallel we index the hadoop file to elasticSearch and query it via a custom API in java.
we want to improve the rate we digest the data , and be able to stream it , and replace the heavy M/R jobs with other framework .
another requirement is being able to query the data from many dimensions , in real time with low latency .
i have heard in this meetup (http://www.meetup.com/Big-Data-Israel/events/228152435/) a lecture by Ankur Gupta from hortonworks that spoke about the same solution for the same use case , but i doesn't have the slide and don't have a way to get it .
it included usage of hive, hbase, pheonix with ORC files , sparkSql .
it should enable the rapid digestion and preparing the data with low latency queries , without a need to write a lot of code like the M/R .
can someone help with the possible solution ?
you should look into Apache Nifi, it has built-in support for SequenceFiles, Hbase, RDBMS, syslog, etc. Firstly, I am not sure we support Hadoop 1.x with Nifi, support is dropped for 1.x line as well. Nifi works with Storm, Spark Streaming and Flink via NAR architecture, so you can stream data in and out from Nifi. It's very easy to get started with Nifi, take a look at the available recipes and it's scalable so you can achieve high throghput. Additionally, just switching to Hadoop 2.x line should give you performance as there are many enhancements done throughout.
@Ankur Gupta Do you have the slides mentioned perhaps?
We are currently implementing a log ingestion usecase. We use Kafka together Spark Streaming for transformations/analytics. ( to do any checks, filters, short window aggregations and potentially lookups).
We are using Flume for the ingestion into Kafka but this is because we use forwarded syslog data.
HDF might be a better solution if starting from scratch. Read logs from a spool directory tranfsorm and push them into Kafka.
For your queries I would suggest Hive/ORC for any aggregation queries with filters. And HBase Phoenix if you have lookup queries like " All logs for a user ".
What kind of volumes are we talking about?
@Artem Ervits thanks for the answers !
i am talking about volumes of about 500 million transactions a day.
1. is HDF is already in production environments with this scale ?
2. can one plug a customize processor to the NI-Fi for spark streaming work for example ?
3. is there a dataflow template for NiFi for the architecture of log ingestion + transformation + hive + hbase (i guess the phoenix and query side is not implemented in the dataflow ) , i don't find this kind of templates in https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates
thanks for the answers! the volume can reach to ~500,000,000 transaction per day .
is the HDF can cope with those volumes ? how does he compared to "traditional" tools in the hadoop eco system ?
what other processors are planned to be supported by ni-fi ? is there a pluggeble processor that one can plug to do the spark streaming work for example ?
I think for this you need a HDF expert.
I am pretty sure it can handle 500m t/d. We are currently implementing Flume/Kafka/Spark for around 4b t/d. And HDF should be significantly better. Now I suppose the bottlenecks rather happen at what you intend to do with it. HBase should cope with these data volumes as well. Hive as well depending on how you do it,
Now I doubt that you can run Spark Streaming outside HDF I would rather use Kafka as a sink for HDF and then use Storm/Spark as a consumer for all heavy analytics and transformations. But again Simon would know more.
2. Yes spark, flink, apex, Storm https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark
3. Use combination of template examples, putHBase and getHBase was just added in 0.4. put and get Phoenix i thin is on the roadmap but you can use jdbc.