Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (2)
avatar
Master Guru

IoT Edge Processing with Apache NiFi and MiniFi and Multiple Deep Learning Libraries Series

For: https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/68140

See Part 1: https://community.hortonworks.com/articles/215079/iot-edge-processing-with-deep-learning-on-hdf-32-a...

See Part 2: https://community.hortonworks.com/articles/215258/iot-edge-processing-with-deep-learning-on-hdf-32-a...

87523-nifiprocessing.jpg

Hive - SQL - IoT Data Storage

In this section, we will focus on converting JSON to AVRO to Apache ORC and storage options in Apache Hive 3. I am doing two styles of storage for one of the tables, rainbow. I am storing ORC files with an external table as well as using the Streaming API to store into an ACID table.

NiFi - SQL - On Streams - Calcite

SELECT *
FROM FLOWFILE
WHERE CAST(memory AS FLOAT) > 0 

SELECT *
FROM FLOWFILE
WHERE CAST(tempf AS FLOAT) > 65

I check the flows as they are ingested real-time and filter based on conditions such as memory or temperature. This makes for some powerful and easy simple event processing. This is very handy when you may want to filter out standard conditions where no anomaly has occurred.

IoT Data Storage Options

For time series data, we are blessed with many options in HDP 3.x. The simplest choice I am doing first here. That's a simple Apache Hive 3.x table. This is where we have some tough decisions which engine to use. Hive has the best, most complete SQL and lots of interfaces. This is my default choice for where and how to store my data. If it was more than a few thousand rows a second and has a timestamp then we have to think about the architecture. Apache Druid has a lot of amazing abilities with time series data like what's coming out of these IoT devices. Since we can join Hive and Druid data and put Hive tables on top of Druid, we really should consider using Druid for our storage handler.

https://cwiki.apache.org/confluence/display/Hive/Druid+Integration

https://cwiki.apache.org/confluence/display/Hive/StorageHandlers

https://hortonworks.com/blog/apache-hive-druid-part-1-3/

https://github.com/apache/hive/blob/master/druid-handler/src/java/org/apache/hadoop/hive/druid/Druid...

https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.0/using-druid/content/druid_anatomy_of_hive_t...

We could create a Hive table backed by Druid thusly:


CREATE TABLE rainbow_druid

STOREDBY 'org.apache.hadoop.hive. druid.DruidStorageHandler'

TBLPROPERTIES ( "druid.segment.granularity" = "MONTH", "druid.query.granularity" = "DAY")

AS

SELECT ts as`__time`,
cast(tempf as string) s_tempf,

ipaddress, cast(altitude as string) s_altitude, host, diskfree,

FROM RAINBOW;

For second and sub-second data, we need to consider either Druid or HBase. The nice thing is these NoSQL options also have SQL interfaces to use. It comes down to how you are going to query the data and which one you like.

HBase + Phoenix is performant and been used in production forever. With HBase 2.x there are really impressive updates that make this a good option.

For richer analytics and some really cool analytics with Apache Superset, it's hard not to recommend Druid. Apache Druid has really been improved recently and well integrated with Hive 3's rich querying.

Example of Our Geo Data

{"speed": "0.145", "diskfree": "4643.2 MB", "altitude": "6.2", "ts": "2018-08-30 17:47:03", "cputemp": 52.0, "latitude": "38.9789405", "track": "0.0", "memory": 26.5, "host": "rainbow", "uniqueid": "gps_uuid_20180830174705", "ipaddress": "172.20.10.8", "epd": "nan", "utc": "2018-08-30T17:47:05.000Z", "epx": "21.91", "epy": "31.536", "epv": "73.37", "ept": "0.005", "eps": "63.07", "longitude": "-74.824475167", "mode": "3", "time": "1535651225.0", "climb": "0.0", "epc": "nan"}


Hive 3 Tables

CREATE EXTERNAL TABLE IF NOT EXISTS rainbow (tempf DOUBLE, cputemp DOUBLE, pressure DOUBLE, host STRING, uniqueid STRING, ipaddress STRING, temp DOUBLE, diskfree STRING, altitude DOUBLE, ts STRING, tempf2 DOUBLE, memory DOUBLE) STORED AS ORC LOCATION '/rainbow'

create table rainbowacid(tempf DOUBLE, cputemp DOUBLE, pressure DOUBLE, host STRING, uniqueid STRING, ipaddress STRING, temp DOUBLE, diskfree STRING, altitude DOUBLE, ts STRING, tempf2 DOUBLE, memory DOUBLE) STORED AS ORC TBLPROPERTIES ('transactional'='true')

CREATE EXTERNAL TABLE IF NOT EXISTS gps (speed STRING, diskfree STRING, altitude STRING, ts STRING, cputemp DOUBLE, latitude STRING, track STRING, memory DOUBLE, host STRING, uniqueid STRING, ipaddress STRING, epd STRING, utc STRING, epx STRING, epy STRING, epv STRING, ept STRING, eps STRING, longitude STRING, mode STRING, time STRING, climb STRING, epc STRING) STORED AS ORC LOCATION '/gps'

CREATE TABLE IF NOT EXISTS gpsacid (speed STRING, diskfree STRING, altitude STRING, ts STRING, cputemp DOUBLE, latitude STRING, track STRING, memory DOUBLE, host STRING, uniqueid STRING, ipaddress STRING, epd STRING, utc STRING, epx STRING, epy STRING, epv STRING, ept STRING, eps STRING, longitude STRING, mode STRING, `time` STRING, climb STRING, epc STRING) STORED AS ORC TBLPROPERTIES ('transactional'='true')

CREATE EXTERNAL TABLE IF NOT EXISTS movidiussense (label5 STRING, runtime STRING, label1 STRING, diskfree STRING, top1 STRING, starttime STRING, label2 STRING, label3 STRING, top3pct STRING, host STRING, top5pct STRING, humidity DOUBLE, currenttime STRING, roll DOUBLE, uuid STRING, label4 STRING, tempf DOUBLE, y DOUBLE, top4pct STRING, cputemp2 DOUBLE, top5 STRING, top2pct STRING, ipaddress STRING, cputemp INT, pitch DOUBLE, x DOUBLE, z DOUBLE, yaw DOUBLE, pressure DOUBLE, top3 STRING, temp DOUBLE, memory DOUBLE, top4 STRING, imagefilename STRING, top1pct STRING, top2 STRING) STORED AS ORC LOCATION '/movidiussense'

CREATE EXTERNAL TABLE IF NOT EXISTS minitensorflow2 (image STRING, ts STRING, host STRING, score STRING, human_string STRING, node_id INT) STORED AS ORC LOCATION '/minifitensorflow2'

Resources:

1,506 Views