I'm new to the Hadoop world and I've been tasked to research solutions to ingest data from our current JMS Queues into our Hadoop cluster.
So far on my quest to becoming a data ingestion expert... I've scoured the web going through books and tutorials for a couple of weeks now. I've managed to write a simple Java Service which listens to one of our Queues and simply write the incoming messages to an HBase HTable.
After completing this proof of concept I have a couple of questions I would like ask the community/Hadoop/Hbase/data ingestion experts. Before I ask let me describe a little bit of my scenario/scope.
With my "proof of concept" Java Service, it works, but I don't know if this solution is the best for my case scenario especially in a Production environment.
Appreciate any input, thanks!
According to this article "an HBase table becomes less efficient once any cell in the table exceeds 100 KB of data". The article also explains how to work around this.
However, to answer your question, understanding the use case (including the behavior of the consumers of the data) is necessary. Depending on the needs, you might instead go JMS->Kafka->Avro Hive table or JMS->JSON partitioned Hive table via Spark Streaming (and the spark-jms-receiver package).
Again, it really depends on the big picture.
Thanks @clukasik for the input. I'll look into MOB.
Big picture we want to capture/retain these messages in their raw form into Hadoop so that we can "replay" selected messages or desired date ranges of messages back into our system or a model. HTable wise, let’s assume worst case scenario that all messages are 20 MB each. So 20 MB x 30,000 x 356 = 213,600,000 MB (213.6 Terabytes). In which each year would at a minimum would double. This might be trivial, but I'm trying to figure out if a HTable or Hive would be able to handle a table of this size especially with row cells that are 20 MB (Being able to pull a couple 100 raw messages or so easily on the fly).
You could save the JSON files directly to HDFS and put a Hive external table on top to run SQL queries against the files. With such "small" file sizes, that may not be ideal in terms of HDFS storage. That could simply be a staging approach. You could migrate the data from Hive external tables to Hive Orc tables.
I like NIFI for JSON ingestion right to Phoenix so you can do SQL queries and HBase access:
You can also easily and quickly funnel the messages as JSON straight to HDFS with no delay. So you get a simple lambda. You can also funnel the data to other stores. Phoenix upsert if you are overwriting and only care about current.
You can funnel to HDFS (fast) and process a bit slower with another process in NIFI real-time.
I don't like 20 megabyte chunks in a column or cell anytime.
You can also speed up ingest by putting something like Alluxio in front of HDFS. You can also push the stream right to Spark for processing.
Have you considered MQTT or Kafka instead of JMS?