1973
Posts
1224
Kudos Received
124
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
89 | 04-03-2024 06:39 AM | |
455 | 01-12-2024 08:19 AM | |
288 | 12-07-2023 01:49 PM | |
640 | 08-02-2023 07:30 AM | |
1083 | 03-29-2023 01:22 PM |
02-14-2017
04:39 PM
If you have any questions on testing spark let me know. Holden's testing is very good way to do this. Also, going line by line in Zeppelin is a great way to debug Spark code.
... View more
02-14-2017
03:13 PM
That Livy is only for Zeppelin, it's not safe to use that In HDP 2.6, there will be a Livy available for general usage.
... View more
02-14-2017
02:59 PM
the processor takes a property to run against. You just need to pass something in the sentence parameter. You can concatenate a few fields there. The source is open, it would be easy to ingest a flowfile and process that instead of doing an input attribute. It's changing 2-3 lines and rebuilding.
... View more
02-13-2017
05:57 PM
1 Kudo
For IO The throughput or latency one can expect to see varies greatly, depending on how the system is configured. Given that there are pluggable approaches to most of the major NiFi subsystems, performance depends on the implementation. But, for something concrete and broadly applicable, consider the out-of-the-box default implementations. These are all persistent with guaranteed delivery and do so using local disk. So being conservative, assume roughly 50 MB per second read/write rate on modest disks or RAID volumes within a typical server. NiFi for a large class of dataflows then should be able to efficiently reach 100 MB per second or more of throughput. That is because linear growth is expected for each physical partition and content repository added to NiFi. This will bottleneck at some point on the FlowFile repository and provenance repository. We plan to provide a benchmarking and performance test template to include in the build, which allows users to easily test their system and to identify where bottlenecks are, and at which point they might become a factor. This template should also make it easy for system administrators to make changes and to verify the impact. For CPU The Flow Controller acts as the engine dictating when a particular processor is given a thread to execute. Processors are written to return the thread as soon as they are done executing a task. The Flow Controller can be given a configuration value indicating available threads for the various thread pools it maintains. The ideal number of threads to use depends on the host system resources in terms of numbers of cores, whether that system is running other services as well, and the nature of the processing in the flow. For typical IO-heavy flows, it is reasonable to make many dozens of threads to be available. For RAM NiFi lives within the JVM and is thus limited to the memory space it is afforded by the JVM. JVM garbage collection becomes a very important factor to both restricting the total practical heap size, as well as optimizing how well the application runs over time. NiFi jobs can be I/O intensive when reading the same content regularly. Configure a large enough disk to optimize performance. See: https://community.hortonworks.com/questions/22685/capacity-planning-for-nifi-cluster.html See: https://community.hortonworks.com/questions/4098/nifi-sizing.html https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#configuration-best-practices https://community.hortonworks.com/content/kbentry/7882/hdfnifi-best-practices-for-setting-up-a-high-perfo.html https://community.hortonworks.com/content/kbentry/9785/nifihdf-dataflow-optimization-part-2-of-2.html See: https://community.hortonworks.com/content/kbentry/9785/nifihdf-dataflow-optimization-part-2-of-2.html http://apache-nifi.1125220.n5.nabble.com/Nifi-Benchmark-Performance-tests-td1099.html http://docs.hortonworks.com/HDPDocuments/HDF2/HDF-2.1.1/bk_dataflow-overview/content/performance-expectations-and-characteristics-of-nifi.html
... View more
02-13-2017
12:03 PM
1 Kudo
Falcon will manage Oozie. And a Web UI instead of XML should be available soon if you don't find one out in the wild that you like. A lot of companies are running Oozie with lots of different jobs and it works well. If you are doing Sqoop, Pig and Hive it's your way to go. With NiFi I run Sqoop, Pig, Spark, Python, TensorFlow and MXNet jobs and connect them. I run them with cron timers and reactive when something happens (files appear, directories change, Kafka message arrives, MQTT message arrives, ...) https://community.hortonworks.com/articles/64844/running-apache-pig-scripts-from-apache-nifi-and-st.html https://community.hortonworks.com/articles/73828/submitting-spark-jobs-from-apache-nifi-using-livy.html https://community.hortonworks.com/content/kbentry/63228/monitoring-your-containers-with-sysdig-from-hdf-20.html https://community.hortonworks.com/articles/81222/adding-stanford-corenlp-to-big-data-pipelines-apac.html https://community.hortonworks.com/articles/59349/hdf-20-flow-for-ingesting-real-time-tweets-from-st.html
https://community.hortonworks.com/articles/61180/streaming-ingest-of-google-sheets-into-a-connected.html
https://community.hortonworks.com/articles/59349/hdf-20-flow-for-ingesting-real-time-tweets-from-st.html
https://community.hortonworks.com/articles/59349/hdf-20-flow-for-ingesting-real-time-tweets-from-st.html
https://community.hortonworks.com/articles/72420/ingesting-remote-sensor-feeds-into-apache-phoenix.html
https://community.hortonworks.com/articles/59349/hdf-20-flow-for-ingesting-real-time-tweets-from-st.html
... View more
02-13-2017
03:33 AM
3 Kudos
Overview I have been running a similar program on Raspberry Pi devices with TensorFlow. Now that MXNet has entered Apache incubation, it has become incredibly interesting to me. With the backing of Apache and Amazon, this library cannot be ignored. So I tried in on the same Raspberry Pi 3B that I was using for TensorFlow. For this example, we are grabbing images from the standard Raspberry Pi Camera and running live image analysis on it with MXNet using the Inception pre-built model from the MXNet Model Zoo. This is the nearly the same as the TensorFlow example. What I noticed is a bit faster execution and smoother process. For accuracy, I have not run enough tests for weighing the two libraries out, but that is something I will look at doing for large number of images. Training both with my camera and images I am interested in would be very helpful. Some use cases I am thinking of are: Security Camera, Water Leak Detection, Evil Cat Sensing, Engine Vibration and self-driving model car. Raspberry Pi v3 B with PI Camera Setup Your Device For Running MXNet sudo apt-get -y install git cmake build-essential g++-4.8 c++-4.8 liblapack* libblas* libopencv*
git clone https://github.com/dmlc/mxnet.git --recursive
cd mxnet
make
cd python
sudo python setup.py install
curl --header 'Host: data.mxnet.io' --header 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Firefox/45.0' --header 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' --header 'Accept-Language: en-US,en;q=0.5' --header 'Referer: http://data.mxnet.io/models/imagenet/' --header 'Connection: keep-alive' 'http://data.mxnet.io/models/imagenet/inception-bn.tar.gz' -o 'inception-bn.tar.gz' -L
tar -xvzf inception-bn.tar.gz
mv Inception_BN-0126.params Inception_BN-0000.params
The primary code is Python taken from some examples from MXNet, OpenCV and PICamera. topn = inception_predict.predict_from_local_file(filename, N=5) This calls the inception_predict from MXNet example. The inception_predict code is referenced in the reference links below. Main Python Code #!/usr/bin/python
# 2017 load pictures and analyze
import time
import sys
import datetime
import subprocess
import sys
import urllib2
import os
import datetime
import ftplib
import traceback
import math
import random, string
import base64
import json
import paho.mqtt.client as mqtt
import picamera
from time import sleep
from time import gmtime, strftime
import inception_predict
packet_size=3000
def randomword(length):
return ''.join(random.choice(string.lowercase) for i in range(length))
# Create camera interface
camera = picamera.PiCamera()
while True:
# Create unique image name
uniqueid = 'mxnet_uuid_{0}_{1}'.format(randomword(3),strftime("%Y%m%d%H%M%S",gmtime()))
# Take the jpg image from camera
filename = '/home/pi/cap.jpg'
# Capture image from RPI
camera.capture(filename)
# Run inception prediction on image
topn = inception_predict.predict_from_local_file(filename, N=5)
# CPU Temp
p = subprocess.Popen(['/opt/vc/bin/vcgencmd','measure_temp'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = p.communicate()
# MQTT
client = mqtt.Client()
client.username_pw_set("username","password")
client.connect("mqttcloudprovider", 14162, 60)
# CPU Temp
out = out.replace('\n','')
out = out.replace('temp=','')
# 5 MXNET Analysis
top1 = str(topn[0][1])
top1pct = str(round(topn[0][0],3) * 100)
top2 = str(topn[1][1])
top2pct = str(round(topn[1][0],3) * 100)
top3 = str(topn[2][1])
top3pct = str(round(topn[2][0],3) * 100)
top4 = str(topn[3][1])
top4pct = str(round(topn[3][0],3) * 100)
top5 = str(topn[4][1])
top5pct = str(round(topn[4][0],3) * 100)
row = [ { 'uuid': uniqueid, 'top1pct': top1pct, 'top1': top1, 'top2pct': top2pct, 'top2': top2,'top3pct': top3pct, 'top3': top3,'top4pct': top4pct,'top4': top4, 'top5pct': top5pct,'top5': top5, 'cputemp': out} ]
json_string = json.dumps(row)
client.publish("mxnet",payload=json_string,qos=1,retain=False)
client.disconnect()
We grab an image from a camera, run it through MXNet, convert the results to JSON and then send the message to a cloud hosted MQTT broker. I also grab the CPU temperature to show we can add more sensors.
Example JSON Sent via MQTT
[{"top1pct": "54.5", "top5": "n04590129 window shade", "top4": "n03452741 grand piano, grand", "top3": "n03018349 china cabinet, china closet", "top2": "n03201208 dining table, board", "top1": "n04099969 rocking chair, rocker", "top2pct": "9.1", "top3pct": "8.0", "uuid": "mxnet_uuid_oqy_20170211203727", "top4pct": "2.8", "top5pct": "2.2", "cputemp": "75.2'C"}] Our schema is pretty consistent as above, so we can create a Hive or Phoenix table and insert into that.
HDF / NiFi Flow Consume MQTT This processor will receive messages from a cloud based MQTT broker sent by a few Raspberry PIs I have setup. Extract Fields from MXNET (EvaluateJSONPath) Build a Message (UpdateAttribute) Category 1 ${top1} at ${top1pct}%
Category 2 ${top2} at ${top2pct}%
Category 3 ${top3} at ${top3pct}%
Category 4 ${top4} at ${top4pct}%
Category 5 ${top5} at ${top5pct}%
UUID ${uuid}
CPU Temp ${cputemp}
Send Msg to Slack Channel (PutSlack)
Channel is mxnet Stores Files (PutFile)
We take the JSON convert it to a text message to a Slack channel. That's all it takes to ingest data from an edge device running a camera and running Deep Learning on a tiny device and then send the data asynchronously to a cloud hosted broker that can distribute to cloud and on-premise hosted Apache NiFi servers. We could also use Site-to-Site, HTTP or TCP/IP. MQTT is very lightweight, works over the Internet, has an easy Python library and works well with Apache NiFi.
Reference: This sample program is critical and gave me most of the code needed to run: http://mxnet.io/tutorials/embedded/wine_detector.html http://data.mxnet.io/models/imagenet/ https://community.hortonworks.com/content/repo/77987/rpi-picamera-mqtt-nifi.html https://github.com/tspannhw/mxnet_rpi/blob/master/analyze.py https://community.hortonworks.com/content/kbentry/80339/iot-capturing-photos-and-analyzing-the-image-with.html CloudMQTT has proven to be awesome. Instant setup and a free instance for testing. This is great for getting data from my remote raspberry pis to the cloud and back into HDF 2.1 servers behind firewalls.
http://cloudmqtt.com http://www.jsonpath.com/ Github Repo https://github.com/tspannhw/mxnet_rpi https://community.hortonworks.com/repos/83001/python-mxnet-raspberry-pi-example.html?shortDescriptionMaxLength=140 Pushing to Slack Channel https://nifi-se.slack.com/messages/mxnet/details/ Apache MXNet Incubation https://wiki.apache.org/incubator/MXNetProposal Awesome MXNet https://github.com/dmlc/mxnet/tree/master/example Install MXNet on Raspian http://mxnet.io/get_started/raspbian_setup.html Example Program for MXNet on Raspberry PI 3 http://mxnet.io/tutorials/embedded/wine_detector.html Raspberry Pi with MXNET http://mxnet.io/tutorials/embedded/wine_detector.html MQTT https://github.com/tspannhw/rpi-picamera-mqtt-nifi/blob/master/upload.py Real-Image with Pretrained Model http://mxnet.io/tutorials/r/classifyRealImageWithPretrainedModel.html MXNet GTC Tutorial
https://github.com/dmlc/mxnet-gtc-tutorial MXNet for Facial Identification
https://github.com/tornadomeet/mxnet-face http://vis-www.cs.umass.edu/fddb/results.html http://www.cbsr.ia.ac.cn/english/CASIA-WebFace-Database.html
MXNet Models for ImageNet 1K Inception BN
https://github.com/dmlc/mxnet-model-gallery/blob/master/imagenet-1k-inception-bn.md MXNet Example Image Classification
https://github.com/dmlc/mxnet/tree/master/example/image-classification sudo apt-get install imagemagick identify -verbose /home/pi/cap.jpg
... View more
Labels:
02-10-2017
08:03 PM
Options: 1. Coordinate the Jobs inside Spark 2. Coordinate the Jobs with Apache NiFi (I have done Sqoop, Hive, HBase, Pig, Spark, Python and Deep Learning jobs with it) 3. Manage Oozie with Falcon http://hortonworks.com/apache/falcon/ 4. HUE is part of HDP, http://gethue.com/scheduling/ 5. Luigi - I used it a few times, seemed okay https://blog.kupstaitis-dunkler.com/2016/07/19/how-to-create-a-data-pipeline-using-luigi/ See: https://www.linkedin.com/pulse/nifi-vs-falcon-oozie-birender-saini
... View more
02-10-2017
03:34 PM
why simulate. A raspberry pi or two can send thousands of mqtt message a second. you could simulate that with Gatling or JMeter
... View more
02-10-2017
01:57 PM
https://airflow.incubator.apache.org/tutorial.html Hortonworks does not support Airflow as of yet. It's in pretty early incubation. Perhaps @Chris Nauroth can shed some light. You might want to try out HDF (Apache NiFi) for job running https://wiki.apache.org/incubator/AirflowProposal Anything that works with Apache Hadoop will work with Hortonworks as HDP is pure 100% open source Apache Hadoop. http://nerds.airbnb.com/airflow/ This is Airbnb's project for the most part, so check out their info. See: https://airflow.incubator.apache.org/code.html macros.random might assist you What's your use case?
... View more
02-09-2017
09:12 PM
See Hadoop here: https://github.com/tensorflow/ecosystem
... View more