Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Master Guru
IoT Edge Processing with Apache NiFi and MiniFi and Multiple Deep Learning Libraries Series

For: https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/68140

87461-strataarchitecture2018.jpg

In preparation for my talk on utilizing edge devices for deep learning, IoT sensor reading and big data processing I have updated my environment to the latest and greatest tools available.

With the upgrade of HDF to 3.2, I can now use Apache NiFi 1.7 and MiniFi 0.5 for IoT data ingestion, simple event processing, conversion, data processing, data flow and storage.

The architecture diagram above shows the basic flow we are utilizing.

IoT Step by Step

  1. Raspberry Pi with latest patches, Python, GPS software, USB Camera, Sensor libraries, Java 8, MiniFi 0.5, TensorFlow and Apache MXNet installed.
  2. minifi flow pushes JSON and JPEGs over HTTP(s) / Site-to-Site to an Apache NiFi gateway server.
  3. Option: NiFi can push to a central NiFi cloud cluster and/or Kafka cluster both of which running on HDF 3.2 environments.
  4. Apache NiFi cluster pushes to Hive, HDFS, Dockerized API running in HDP 3.0 and Third Party APIs.
  5. NiFi and Kafka integrate with Schema Registry for our tabular data including rainbow and gps JSON data.

SQL Tables in Hive

I stream my data into Apache ORC files stored on HDP 3.0 HDFS directories and build external tables on them.

CREATE EXTERNAL TABLE IF NOT EXISTS rainbow (tempf DOUBLE, cputemp DOUBLE, pressure DOUBLE, host STRING, uniqueid STRING, ipaddress STRING, temp DOUBLE, diskfree STRING, altitude DOUBLE, ts STRING, 
 tempf2 DOUBLE, memory DOUBLE) 
STORED AS ORC LOCATION '/rainbow';

CREATE EXTERNAL TABLE IF NOT EXISTS gps (speed STRING, diskfree STRING, altitude STRING, ts STRING, cputemp DOUBLE, latitude STRING, track STRING, memory DOUBLE, host STRING, uniqueid STRING, ipaddress STRING, epd STRING, utc STRING, epx STRING, epy STRING, epv STRING, ept STRING, eps STRING, longitude STRING, mode STRING, time STRING, climb STRING, epc STRING) 
STORED AS ORC LOCATION '/gps';

For my processing needs I also have a Hive 3 ACID table for general table usage and updates.

create table rainbowacid(tempf DOUBLE, cputemp DOUBLE, pressure DOUBLE, host STRING, uniqueid STRING, ipaddress STRING, temp DOUBLE, diskfree STRING, altitude DOUBLE, ts STRING, 
                                             tempf2 DOUBLE, memory DOUBLE) STORED AS ORC 
                        TBLPROPERTIES ('transactional'='true');

CREATE TABLE IF NOT EXISTS gpsacid (speed STRING, diskfree STRING, altitude STRING, ts STRING, cputemp DOUBLE, latitude STRING, track STRING, memory DOUBLE, host STRING, uniqueid STRING, ipaddress STRING, epd STRING, utc STRING, epx STRING, epy STRING, epv STRING, ept STRING, eps STRING, longitude STRING, mode STRING, time STRING, climb STRING, epc STRING) STORED AS ORC
                        TBLPROPERTIES ('transactional'='true');

Then I load my initial data.

insert into rainbowacid
select * from rainbow;

insert into gpsacid 
select * from gps;

87419-rainbowupsroadtripnifi.png

Hive 3.x Updates

%jdbc(hive) CREATE TABLE Persons_default (
    ID Int NOT NULL,
    Name String NOT NULL,
    Age Int,
    Creator String DEFAULT CURRENT_USER(),
    CreateDate Date DEFAULT CURRENT_DATE()
)

One of the cool new features in Hive is that you can now have defaults, as you can see which are helpful for things like standard defaults you might want like current data. This gives us even more relational style features in Hive.

Another very interesting feature is materialized views which help you for having clean and fast subqueries. Here is a cool example:

CREATE MATERIALIZED VIEW mv1
AS
SELECT dest,origin,count(*)
FROM flights_hdfs 
GROUP BY dest,origin


References:

1,504 Views