Support Questions
Find answers, ask questions, and share your expertise

Need advice, Saving messages from JMS Queue to Hadoop Hbase a good solution?

New Contributor

I'm new to the Hadoop world and I've been tasked to research solutions to ingest data from our current JMS Queues into our Hadoop cluster.

So far on my quest to becoming a data ingestion expert... I've scoured the web going through books and tutorials for a couple of weeks now. I've managed to write a simple Java Service which listens to one of our Queues and simply write the incoming messages to an HBase HTable.

After completing this proof of concept I have a couple of questions I would like ask the community/Hadoop/Hbase/data ingestion experts. Before I ask let me describe a little bit of my scenario/scope.

  • We receive approximately 30,000 messages per day from our JMS Queue
  • These messages are JSON objects which can range anywhere from 1 MB to 20 MB each
  • Needs to be near real time
  • We would like to continuously save these messages into Hadoop for future analytics and historical reference for years to come
  • We don't need to parse the incoming messages, just store them. (Current line of thinking is to write another Service which will parse these messages and save them into proper schema later. Reason = No bottlenecking during message ingestion)

With my "proof of concept" Java Service, it works, but I don't know if this solution is the best for my case scenario especially in a Production environment.

  1. Is this a good approach/solution for my case scenario?
  2. If not, what other technologies would be a good fit for what I'm trying to do?
  3. Is using HBase for this overkill?
  4. Saving up to 20 MB in a single cell a good idea? Especially if we plan to continuously append messages to this table with no purging?

Appreciate any input, thanks!


Re: Need advice, Saving messages from JMS Queue to Hadoop Hbase a good solution?

Expert Contributor

According to this article "an HBase table becomes less efficient once any cell in the table exceeds 100 KB of data". The article also explains how to work around this.

However, to answer your question, understanding the use case (including the behavior of the consumers of the data) is necessary. Depending on the needs, you might instead go JMS->Kafka->Avro Hive table or JMS->JSON partitioned Hive table via Spark Streaming (and the spark-jms-receiver package).

Again, it really depends on the big picture.

Re: Need advice, Saving messages from JMS Queue to Hadoop Hbase a good solution?

New Contributor

Thanks @clukasik for the input. I'll look into MOB.

Big picture we want to capture/retain these messages in their raw form into Hadoop so that we can "replay" selected messages or desired date ranges of messages back into our system or a model. HTable wise, let’s assume worst case scenario that all messages are 20 MB each. So 20 MB x 30,000 x 356 = 213,600,000 MB (213.6 Terabytes). In which each year would at a minimum would double. This might be trivial, but I'm trying to figure out if a HTable or Hive would be able to handle a table of this size especially with row cells that are 20 MB (Being able to pull a couple 100 raw messages or so easily on the fly).

Re: Need advice, Saving messages from JMS Queue to Hadoop Hbase a good solution?

@Joey Wilkinson

Have you considered using Hortonworks DataFlow or Apache NiFi? This example is JMS to HDFS, but you can adapt it: HCC Article.

You could save the JSON files directly to HDFS and put a Hive external table on top to run SQL queries against the files. With such "small" file sizes, that may not be ideal in terms of HDFS storage. That could simply be a staging approach. You could migrate the data from Hive external tables to Hive Orc tables.

Re: Need advice, Saving messages from JMS Queue to Hadoop Hbase a good solution?

Super Guru

I like NIFI for JSON ingestion right to Phoenix so you can do SQL queries and HBase access:

You can also easily and quickly funnel the messages as JSON straight to HDFS with no delay. So you get a simple lambda. You can also funnel the data to other stores. Phoenix upsert if you are overwriting and only care about current.

You can funnel to HDFS (fast) and process a bit slower with another process in NIFI real-time.

I don't like 20 megabyte chunks in a column or cell anytime.

You can also speed up ingest by putting something like Alluxio in front of HDFS. You can also push the stream right to Spark for processing.

Have you considered MQTT or Kafka instead of JMS?