In preparation for my talk on utilizing edge devices for deep learning, IoT sensor reading and big data processing I have updated my environment to the latest and greatest tools available.
With the upgrade of HDF to 3.2, I can now use Apache NiFi 1.7 and MiniFi 0.5 for IoT data ingestion, simple event processing, conversion, data processing, data flow and storage.
The architecture diagram above shows the basic flow we are utilizing.
IoT Step by Step
Raspberry Pi with latest patches, Python, GPS software, USB Camera, Sensor libraries, Java 8, MiniFi 0.5, TensorFlow and Apache MXNet installed.
minifi flow pushes JSON and JPEGs over HTTP(s) / Site-to-Site to an Apache NiFi gateway server.
Option: NiFi can push to a central NiFi cloud cluster and/or Kafka cluster both of which running on HDF 3.2 environments.
Apache NiFi cluster pushes to Hive, HDFS, Dockerized API running in HDP 3.0 and Third Party APIs.
NiFi and Kafka integrate with Schema Registry for our tabular data including rainbow and gps JSON data.
SQL Tables in Hive
I stream my data into Apache ORC files stored on HDP 3.0 HDFS directories and build external tables on them.
insert into rainbowacid
select * from rainbow;
insert into gpsacid
select * from gps;
Hive 3.x Updates
%jdbc(hive) CREATE TABLE Persons_default (
ID Int NOT NULL,
Name String NOT NULL,
Age Int,
Creator String DEFAULT CURRENT_USER(),
CreateDate Date DEFAULT CURRENT_DATE()
)
One of the cool new features in Hive is that you can now have defaults, as you can see which are helpful for things like standard defaults you might want like current data. This gives us even more relational style features in Hive.
Another very interesting feature is materialized views which help you for having clean and fast subqueries. Here is a cool example:
CREATE MATERIALIZED VIEW mv1
AS
SELECT dest,origin,count(*)
FROM flights_hdfs
GROUP BY dest,origin