Community Articles

TimothySpann · ‎12-29-2018

Implementing Streaming Machine Learning and Deep Learning In Production Part 1

After we have done our data exploration with Apache Zeppelin, Hortonworks Data Analytics Studio and other Data Science Notebooks and Tools, we will start building iterations of ever improving models that need to be used in live environments. These will need to run at scale and score millions of records in real-time streams.

These can be in various frameworks, versions,types and many options of data required.

There are a number of things we need to think about when doing this.

96519-92547-gluoncv-image-p-20180924172456-eb06deb0-bd2f.jpg

Model Deployment Options

Apache Spark
Apache Storm (Hortonworks Streaming Analytics Manager - SAM)
Apache Kafka Streams
Apache NiFi
YARN 3.1
YARN Submarine
TensorFlow Serving on YARN
Cloudera Data Science Workbench

Requirements

Classification
REST API
Security
Automation
Data Lineage
Schema Versioning, REST API and Management
Data Provenance
Scripting
Integration with Kafka
Containerized Services
Support Docker Containers running on YARN
Support Dockerized Spark Jobs
Model Registry
Scalability
Data Variety
Data and Storage Format
Flexiblity
Handling Media Types such as images, sound and video

Required Elements

Apache NiFi 1.8.0
Apache Kafka 2.0
Apache Kafka Streams 2.0
Apache Atlas 1.0.0
Apache Ranger 1.2.0
Apache Knox 1.0
Hortonworks Streams Messaging Manager 1.2.0
Hortonworks Schema Registry 0.5.2
NiFi Registry 0.2.0
Apache Hadoop 3.1
Apache YARN 3.1+
Apache HDFS or Amazon S3
Apache Druid 0.12.1
Apache HBase 2.0

Apache Spark - Apache NiFi

There are a number of options for running Machine Learning models in production via Apache NiFi. I have use these methods.

Apache NiFi to Apache Spark Integration via Kafka and Spark Streaming
Apache NiFi to Apache Spark Integration via Kafka and Spark Structured Streaming
Apache NiFi to Apache Spark Integration via Apache Livy

https://community.hortonworks.com/content/kbentry/174105/hdp-264-hdf-31-apache-spark-structured-stre...

https://community.hortonworks.com/articles/174105/hdp-264-hdf-31-apache-spark-structured-streaming-i...

https://community.hortonworks.com/content/kbentry/171787/hdf-31-executing-apache-spark-via-executesp...

Hadoop - YARN 3.1 - No Docker - No Spark

We can deploy Deep Learning Models and run classification (as well as training) on YARN natively.

https://community.hortonworks.com/content/kbentry/222242/running-apache-mxnet-deep-learning-on-yarn-...

https://community.hortonworks.com/articles/224268/running-tensorflow-on-yarn-31-with-or-without-gpu....

Apache Kafka Streams

Kafka Streams has full integration Platform services including Schema Registry, Ranger and Ambari.

Apache NiFi Native Java Processors for Classification

We can use a custom processor in Java that runs as a native part of the dataflow.

Apache NiFi Integration with a Model Server Native to a Framework