In this section, we will focus on converting JSON to AVRO to Apache ORC and storage options in Apache Hive 3. I am doing two styles of storage for one of the tables, rainbow. I am storing ORC files with an external table as well as using the Streaming API to store into an ACID table.
NiFi - SQL - On Streams - Calcite
WHERE CAST(memory AS FLOAT) > 0
WHERE CAST(tempf AS FLOAT) > 65
I check the flows as they are ingested real-time and filter based on conditions such as memory or temperature. This makes for some powerful and easy simple event processing. This is very handy when you may want to filter out standard conditions where no anomaly has occurred.
IoT Data Storage Options
For time series data, we are blessed with many options in HDP 3.x. The simplest choice I am doing first here. That's a simple Apache Hive 3.x table. This is where we have some tough decisions which engine to use. Hive has the best, most complete SQL and lots of interfaces. This is my default choice for where and how to store my data. If it was more than a few thousand rows a second and has a timestamp then we have to think about the architecture. Apache Druid has a lot of amazing abilities with time series data like what's coming out of these IoT devices. Since we can join Hive and Druid data and put Hive tables on top of Druid, we really should consider using Druid for our storage handler.
SELECT ts as`__time`, cast(tempf as string) s_tempf,
ipaddress, cast(altitude as string) s_altitude, host, diskfree,
For second and sub-second data, we need to consider either Druid or HBase. The nice thing is these NoSQL options also have SQL interfaces to use. It comes down to how you are going to query the data and which one you like.
HBase + Phoenix is performant and been used in production forever. With HBase 2.x there are really impressive updates that make this a good option.
For richer analytics and some really cool analytics with Apache Superset, it's hard not to recommend Druid. Apache Druid has really been improved recently and well integrated with Hive 3's rich querying.