About TimothySpann

TimothySpann · ‎10-04-2018

Properties File Lookup Augmentation of Data Flow in Apache NiFi 1.7.x A really cool technologist contacted me on LinkedIn and asked an interesting question Tim, How do I read values from a properties file and use them in my flow. I want to update/inject an attribute with this value. If you don't want to use the Variable Registry, but want to inject a value from a properties file how to do it. You could run some REST server and read it or does some file reading hack. But we have a great service to do this very easily! In my UpdateAttribute (or in your regular attributes already), I have an attribute named, keytofind. This contains a lookup key such as an integer or a string key. We will find that value in the properties value and give you that in an attribute of your choosing. We have a Controller Service to handle this for you. It reads from your specified properties file. Make sure Apache NiFi has permissions to that path and can read the file. PropertiesFileLookupService https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-lookup-services-nar/1.7.1/org.apache.nifi.lookup.PropertiesFileLookupService/index.html We lookup the key specified in the “keytofind”. It returns a value that you specify as an extra attribute, mine is “updatedvalue”. This is my properties file: -rwxrwxrwx 1 tspann staff 67 Oct 4 09:15 lookup.properties stuff1=value1 stuff2=value2 stuff3=value other tim=spann nifi=cool In this example, we are using the LookupAttribute processor. You can also use the LookupRecord processor depending on your needs. Resources: http://discover.attunity.com/apache-nifi-for-dummies-en-report-go-c-lp8558.html https://community.hortonworks.com/articles/140231/data-flow-enrichment-with-nifi-part-2-lookupattrib.html https://community.hortonworks.com/articles/189213/etl-with-lookups-with-apache-hbase-and-apache-nifi.html The Flow lookup-from-properties-values.xml

TimothySpann · ‎10-02-2018

Not a kerberized cluster. maybe: https://stackoverflow.com/questions/40595332/how-to-connect-to-a-kerberos-secured-apache-phoenix-data-source-with-wildfly

TimothySpann · ‎09-28-2018

This is an extension of this article: https://community.hortonworks.com/articles/163776/parsing-any-document-with-apache-nifi-15-with-apac.html

TimothySpann · ‎09-28-2018

Related: ApacheCon 2018 in Montreal With my talk in a dual language city, I thought perhaps I should do my talk in French. My college French is very rusty and my accent is very New Jersey. The two don't mix well. So let's have Apache NiFi do it for me. After publically debating this on Twitter, I decided to see if I could implement a solution. Secret: Most of the heavy lifting is done by Python which calls Google Translate API under the covers, automagically. My Presentation is here: https://www.slideshare.net/bunkertor/apache-deep-learning-101-apachecon-montreal-2018-v031 Flow to Extract French Apache Tika extracts the text from PDF or PPTX and converts to text or HTML. ( I chose text). Run the Translate Python with a sentence extracted from the PDF or PPTX. Let's send that translated french line to it's own slack channel. Let's send the english to another. And there it is: runtranslate.sh python3.6 -W ignore /Volumes/TSPANN/2018/talks/IOT/translate.py "$1" 2>/dev/null translate.py from textblob import TextBlob import sys text = "" for x in sys.argv[1:]: text += str(x) #text = sys.stdin.read() #print(text) blob = TextBlob(text) #for sentence in blob.sentences: # print(sentence.sentiment.polarity) # 0.060 # -0.341 print(blob.translate(to="fr") ) NiFi Flow make-it-french.xml

TimothySpann · ‎09-24-2018

Using GluonCV 0.3 with Apache MXNet 1.3 source code: https://github.com/tspannhw/nifi-gluoncv-yolo3 *Captured and Processed Image Available for Viewing in Stream in Apache NiFi 1.7.x use case: I need to easily monitor the contents of my security vault. It is a fixed number of known things. What we need in the real world is a nice camera(s) (maybe four to eight depending on angles of the room), a device like an NVidia Jetson TX2, MiniFi 0.5 Java Agent, JDK 8, Apache MXNet, GluonCV, Lots of Python Libraries, a network connection and a simple workflow. Outside of my vault, I will need a server(s) or clusters to do the more advanced processing, though I could run it all on the local box. If the number of items or certain items I am watching are no longer in the screen, then we should send an immediate alert. That could be to an SMS, Email, Slack, Alert System or other means. We had most of that implemented below. If anyone wants to do the complete use case I can assist. demo implementation: I wanted to use the new YOLO 3 model which is part of the new 0.3 stream, so I installed a 0.3. This may be final by the time you read this. You can try to do a regular pip3.6 install -U gluoncv and see what you get. pip3.6 install -U gluoncv==0.3.0b20180924 Yolo v3 is a great pretrained model to use for object detection. See: https://gluon-cv.mxnet.io/build/examples_detection/demo_yolo.html The GluonCV Model Zoo is very rich and incredibly easy to use. So we just grab the model "yolo3_darknet53_voc" with an automatic one time download and we are ready to go. They provide easy to customize code to start with. I write my processed image and JSON results out for ingest by Apache NiFi. You will notice this is similar to what we did for the Open Computer Vision talks: https://community.hortonworks.com/articles/198939/using-apache-mxnet-gluoncv-with-apache-nifi-for-de.html This is updated and even easier. I dropped the MQTT and just output image files and some JSON to read. GluonCV makes working with Computer Vision extremely clean and easy. why Apache NiFi For Deep Learning Workflows Let me count the top five ways: #1 Provenance - lets me see everything, everywhere, all the time with the data and the metadata. #2 Configurable Queues - queues are everywhere and they are extremely configurable on size and priority. There's always backpressure and safety between every step. Sinks, Sources and steps can be offline as things happen in the real-world internet. Offline, online, wherever, I can recover and have full visibility into my flows as they spread between devices, servers, networks, clouds and nation-states. #3 Security - secure at every level from SSL and data encryption. Integration with leading edge tools including Apache Knox, Apache Ranger and Apache Atlas. See: https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.1.1/bk_security/content/ch_enabling-knox-for-nifi.html #4 UI - a simple UI to develop, monitor and manage incredibly complex flows including IoT, Deep Learning, Logs and every data source you can throw at it. #5 Agents - MiniFi gives me two different agents for my devices or systems to stream data headless. running gluoncv yolo3 model I wrap my Python script in a shell script to throw away warnings and junk cd /Volumes/TSPANN/2018/talks/ApacheDeepLearning101/nifi-gluoncv-yolo3 python3.6 -W ignore /Volumes/TSPANN/2018/talks/ApacheDeepLearning101/nifi-gluoncv-yolo3/yolonifi.py 2>/dev/null List of Possible Objects We Can Detect ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"] I am going to train this with my own data for the upcoming INTERNET OF BEER, for the vault use case we would need your vault content pictures. See: https://gluon-cv.mxnet.io/build/examples_datasets/detection_custom.html#sphx-glr-build-examples-datasets-detection-custom-py Example Output in JSON {"imgname": "images/gluoncv_image_20180924190411_b90c6ba4-bbc7-4bbf-9f8f-ee5a6a859602.jpg", "imgnamep": "images/gluoncv_image_p_20180924190411_b90c6ba4-bbc7-4bbf-9f8f-ee5a6a859602.jpg", "class1": "tvmonitor", "pct1": "49.070724999999996", "host": "HW13125.local", "shape": "(1, 3, 512, 896)", "end": "1537815855.105193", "te": "4.199203014373779", "battery": 100, "systemtime": "09/24/2018 15:04:15", "cpu": 33.7, "diskusage": "49939.2 MB", "memory": 60.1, "id": "20180924190411_b90c6ba4-bbc7-4bbf-9f8f-ee5a6a859602"} Example Processed Image Output It found one generic person, we could train against a known set of humans that are allowed in an area or are known users. nifi flows Gateway Server (We could skip this, but aggregating multiple camera agents is useful) Send the Flow to the Cloud Cloud Server Site-to-Site After we infer the schema of the data once, we don't need it again. We could derive the schema manually or from another tool, but this is easy. Once you are done, then you can delete the InferAvroSchema processor from your flow. I left mine in for your uses if you wish to start from this flow that is attached at the end of the article. flow steps Route When No Error to Merge Record Then Convert Those Aggregated Apache Avro Records into One Apache ORC file. Then store it in an HDFS directory. Once complete their will be a DDL added to metadata that you can send to a PutHiveQL or manually create the table in Beeline or Zeppelin or Hortonworks Data Analytics Studio (https://hortonworks.com/products/dataplane/data-analytics-studio/). schema: gluoncvyolo { "type" : "record", "name" : "gluoncvyolo", "fields" : [ { "name" : "imgname", "type" : "string", "doc" : "Type inferred from '\"images/gluoncv_image_20180924211055_8f3b9dac-5645-49aa-94e7-ee5176c3f55c.jpg\"'" }, { "name" : "imgnamep", "type" : "string", "doc" : "Type inferred from '\"images/gluoncv_image_p_20180924211055_8f3b9dac-5645-49aa-94e7-ee5176c3f55c.jpg\"'" }, { "name" : "class1", "type" : "string", "doc" : "Type inferred from '\"tvmonitor\"'" }, { "name" : "pct1", "type" : "string", "doc" : "Type inferred from '\"95.71207000000001\"'" }, { "name" : "host", "type" : "string", "doc" : "Type inferred from '\"HW13125.local\"'" }, { "name" : "shape", "type" : "string", "doc" : "Type inferred from '\"(1, 3, 512, 896)\"'" }, { "name" : "end", "type" : "string", "doc" : "Type inferred from '\"1537823458.559896\"'" }, { "name" : "te", "type" : "string", "doc" : "Type inferred from '\"3.580893039703369\"'" }, { "name" : "battery", "type" : "int", "doc" : "Type inferred from '100'" }, { "name" : "systemtime", "type" : "string", "doc" : "Type inferred from '\"09/24/2018 17:10:58\"'" }, { "name" : "cpu", "type" : "double", "doc" : "Type inferred from '12.0'" }, { "name" : "diskusage", "type" : "string", "doc" : "Type inferred from '\"48082.7 MB\"'" }, { "name" : "memory", "type" : "double", "doc" : "Type inferred from '70.6'" }, { "name" : "id", "type" : "string", "doc" : "Type inferred from '\"20180924211055_8f3b9dac-5645-49aa-94e7-ee5176c3f55c\"'" } ] } Tabular data has fields with types and properties. Let's specify those for automated analysis, conversion and live stream SQL. hive table schema: gluoncvyolo CREATE EXTERNAL TABLE IF NOT EXISTS gluoncvyolo (imgname STRING, imgnamep STRING, class1 STRING, pct1 STRING, host STRING, shape STRING, end STRING, te STRING, battery INT, systemtime STRING, cpu DOUBLE, diskusage STRING, memory DOUBLE, id STRING) STORED AS ORC; Apache NiFi generates tables for me in Apache Hive 3.x as Apache ORC files for fast performance. hive acid table schema: gluoncvyoloacid CREATE TABLE gluoncvyoloacid (imgname STRING, imgnamep STRING, class1 STRING, pct1 STRING, host STRING, shape STRING, `end` STRING, te STRING, battery INT, systemtime STRING, cpu DOUBLE, diskusage STRING, memory DOUBLE, id STRING) STORED AS ORC TBLPROPERTIES ('transactional'='true') I can just as easily insert or update data into Hive 3.x ACID 2 tables. We have data, now query it. Easy, no install analytics with tables, Leafletjs, AngularJS, graphs, maps and charts. nifi flow registry To manage version control I am using the NiFi Registry which is great. In the newest version, 0.2, there is the ability to back it up with github! It's easy. Everything you need to know is in the doc and Bryan Bend's excellent post on the subject. https://nifi.apache.org/docs/nifi-registry-docs/index.html https://bryanbende.com/development/2018/06/20/apache-nifi-registry-0-2-0 There were a few gotchas for me. Use your own new github project with permissions and then clone it local git clone https://github.com/tspannhw/nifi-registry-github.git Make sure github directory has permission and is empty (no readme or junk) Make sure you put in the full directory path Update your config like below: <flowPersistenceProvider> <class>org.apache.nifi.registry.provider.flow.git.GitFlowPersistenceProvider</class> <property name="Flow Storage Directory">/Users/tspann/Documents/nifi-registry-0.2.0/conf/nifi-registry-github</property> <property name="Remote To Push">origin</property> <property name="Remote Access User">tspannhw</property> <property name="Remote Access Password">generatethis</property> </flowPersistenceProvider> This is my github directory to hold versions: https://github.com/tspannhw/nifi-registry-github resources: https://github.com/tspannhw/UsingGluonCV https://gluon.mxnet.io/chapter01_crashcourse/ndarray.html https://gluon-cv.mxnet.io/build/examples_detection/demo_yolo.html#sphx-glr-build-examples-detection-demo-yolo-py https://gluon-cv.mxnet.io/model_zoo/index.html#object-detection https://community.hortonworks.com/articles/215271/iot-edge-processing-with-deep-learning-on-hdf-32-a-2.html https://community.hortonworks.com/articles/198912/ingesting-apache-mxnet-gluon-deep-learning-results.html zeppelin notebook apache-mxnet-gluoncv-yolov3-copy.json nifi flow gluoncv-server.xml

TimothySpann · ‎09-21-2018

Running Apache MXNet Deep Learning on YARN 3.1 - HDP 3.0 With Hadoop 3.1 / HDP 3.0, we can easily run distributed classification, training and other deep learning jobs. I am using Apache MXNet with Python. You can also do TensorFlow or Pytorch. If you need GPU resources, you can specify them as such: yarn.io/gpu=2 My cluster does not have an NVidia GPU unfortunately. See: https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.0/data-operating-system/content/dosg_recommendations_for_running_docker_containers_on_yarn.html Running App on YARN [root@princeton0 ApacheDeepLearning101]# ./yarn.sh 18/09/21 15:31:22 INFO distributedshell.Client: Initializing Client 18/09/21 15:31:22 INFO distributedshell.Client: Running Client 18/09/21 15:31:22 INFO client.RMProxy: Connecting to ResourceManager at princeton0.field.hortonworks.com/172.26.208.140:8050 18/09/21 15:31:23 INFO client.AHSProxy: Connecting to Application History server at princeton0.field.hortonworks.com/172.26.208.140:10200 18/09/21 15:31:23 INFO distributedshell.Client: Got Cluster metric info from ASM, numNodeManagers=1 18/09/21 15:31:23 INFO distributedshell.Client: Got Cluster node info from ASM 18/09/21 15:31:23 INFO distributedshell.Client: Got node report from ASM for, nodeId=princeton0.field.hortonworks.com:45454, nodeAddress=princeton0.field.hortonworks.com:8042, nodeRackName=/default-rack, nodeNumContainers=4 18/09/21 15:31:23 INFO distributedshell.Client: Queue info, queueName=default, queueCurrentCapacity=0.4, queueMaxCapacity=1.0, queueApplicationCount=8, queueChildQueueCount=0 18/09/21 15:31:23 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=SUBMIT_APPLICATIONS 18/09/21 15:31:23 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=ADMINISTER_QUEUE 18/09/21 15:31:23 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=SUBMIT_APPLICATIONS 18/09/21 15:31:23 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=ADMINISTER_QUEUE 18/09/21 15:31:23 INFO distributedshell.Client: Max mem capability of resources in this cluster 15360 18/09/21 15:31:23 INFO distributedshell.Client: Max virtual cores capability of resources in this cluster 12 18/09/21 15:31:23 WARN distributedshell.Client: AM Memory not specified, use 100 mb as AM memory 18/09/21 15:31:23 WARN distributedshell.Client: AM vcore not specified, use 1 mb as AM vcores 18/09/21 15:31:23 WARN distributedshell.Client: AM Resource capability=<memory:100, vCores:1> 18/09/21 15:31:23 INFO distributedshell.Client: Copy App Master jar from local filesystem and add to local environment 18/09/21 15:31:24 INFO distributedshell.Client: Set the environment for the application master 18/09/21 15:31:24 INFO distributedshell.Client: Setting up app master command 18/09/21 15:31:24 INFO distributedshell.Client: Completed setting up app master command {{JAVA_HOME}}/bin/java -Xmx100m org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster --container_type GUARANTEED --container_memory 512 --container_vcores 1 --num_containers 1 --priority 0 1><LOG_DIR>/AppMaster.stdout 2><LOG_DIR>/AppMaster.stderr 18/09/21 15:31:24 INFO distributedshell.Client: Submitting application to ASM 18/09/21 15:31:24 INFO impl.YarnClientImpl: Submitted application application_1536697796040_0022 18/09/21 15:31:25 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=AM container is launched, waiting for AM container to Register with RM, appMasterHost=N/A, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://princeton0.field.hortonworks.com:8088/proxy/application_1536697796040_0022/, appUser=root 18/09/21 15:31:26 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=AM container is launched, waiting for AM container to Register with RM, appMasterHost=N/A, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://princeton0.field.hortonworks.com:8088/proxy/application_1536697796040_0022/, appUser=root 18/09/21 15:31:27 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=AM container is launched, waiting for AM container to Register with RM, appMasterHost=N/A, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://princeton0.field.hortonworks.com:8088/proxy/application_1536697796040_0022/, appUser=root 18/09/21 15:31:28 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=, appMasterHost=princeton0/172.26.208.140, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://princeton0.field.hortonworks.com:8088/proxy/application_1536697796040_0022/, appUser=root 18/09/21 15:31:29 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=, appMasterHost=princeton0/172.26.208.140, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://princeton0.field.hortonworks.com:8088/proxy/application_1536697796040_0022/, appUser=root 18/09/21 15:31:30 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=, appMasterHost=princeton0/172.26.208.140, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://princeton0.field.hortonworks.com:8088/proxy/application_1536697796040_0022/, appUser=root 18/09/21 15:31:31 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=, appMasterHost=princeton0/172.26.208.140, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://princeton0.field.hortonworks.com:8088/proxy/application_1536697796040_0022/, appUser=root 18/09/21 15:31:32 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=, appMasterHost=princeton0/172.26.208.140, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://princeton0.field.hortonworks.com:8088/proxy/application_1536697796040_0022/, appUser=root 18/09/21 15:31:33 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=, appMasterHost=princeton0/172.26.208.140, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=FINISHED, distributedFinalState=SUCCEEDED, appTrackingUrl=http://princeton0.field.hortonworks.com:8088/proxy/application_1536697796040_0022/, appUser=root 18/09/21 15:31:33 INFO distributedshell.Client: Application has completed successfully. Breaking monitoring loop 18/09/21 15:31:33 INFO distributedshell.Client: Application completed successfully Results: https://github.com/tspannhw/ApacheDeepLearning101/blob/master/run.log Script: https://github.com/tspannhw/ApacheDeepLearning101/blob/master/yarn.sh yarn jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar -shell_command python3.6 -shell_args "/opt/demo/ApacheDeepLearning101/analyzex.py /opt/demo/images/201813161108103.jpg" -container_resources memory-mb=512,vcores=1 For pre-HDP 3.0, see my older script using the DMLC YARN runner. We don't need that anymore. No Spark either. https://github.com/tspannhw/nifi-mxnet-yarn Python MXNet Script: https://github.com/tspannhw/ApacheDeepLearning101/blob/master/analyzehdfs.py Since we are distributed, let's write the results to HDFS. We can use and install the Python HDFS library that works on Python 2.7 and 3.x. So let's pip install it. pip install hdfs In our code: from hdfs import InsecureClient client = InsecureClient('http://princeton0.field.hortonworks.com:50070', user='root') from json import dumps client.write('/mxnetyarn/' + uniqueid + '.json', dumps(row)) We write our row as JSON to HDFS. When the job completes in YARN, we get a new JSON file written to HDFS. hdfs dfs -ls /mxnetyarn Found 2 items -rw-r--r-- 3 root hdfs 424 2018-09-21 17:50 /mxnetyarn/mxnet_uuid_img_20180921175007.json -rw-r--r-- 3 root hdfs 424 2018-09-21 17:55 /mxnetyarn/mxnet_uuid_img_20180921175552.json hdfs dfs -cat /mxnetyarn/mxnet_uuid_img_20180921175552.json {"uuid": "mxnet_uuid_img_20180921175552", "top1pct": "49.799999594688416", "top1": "n03063599 coffee mug", "top2pct": "21.50000035762787", "top2": "n07930864 cup", "top3pct": "12.399999797344208", "top3": "n07920052 espresso", "top4pct": "7.500000298023224", "top4": "n07584110 consomme", "top5pct": "5.200000107288361", "top5": "n04263257 soup bowl", "imagefilename": "/opt/demo/images/201813161108103.jpg", "runtime": "0"} HDP Assemblies https://github.com/hortonworks/hdp-assemblies/ https://github.com/hortonworks/hdp-assemblies/blob/master/tensorflow/markdown/Dockerfile.md https://github.com/hortonworks/hdp-assemblies/blob/master/tensorflow/markdown/TensorflowOnYarnTutorial.md https://github.com/hortonworks/hdp-assemblies/blob/master/tensorflow/markdown/RunTensorflowJobUsingHelperScript.md *** SUBMARINE ** Coming soon, Submarine is really cool new way. https://github.com/leftnoteasy/hadoop-1/tree/submarine/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine See this awesome presentation from Strata NYC 2018 by Wangda Tan (Hortonworks): https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/68289 See the quick start for setting Docker and GPU options: https://github.com/leftnoteasy/hadoop-1/blob/submarine/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/QuickStart.md Resources: https://community.hortonworks.com/articles/60480/using-images-stored-in-hdfs-for-web-pages.html

TimothySpann · ‎09-21-2018

No need for kafka Follow my articles https://community.hortonworks.com/articles/149910/handling-hl7-records-part-1-hl7-ingest.html https://community.hortonworks.com/articles/149891/handling-hl7-records-and-storing-in-apache-hive-fo.html You can read it from files. which simulator is it?

TimothySpann · ‎09-11-2018

If you see ParseSQL is an extracttext processor. We use extract text to get the SQL statement created by generate table fetch. We add a new attribute, sql, with value, ^(.*).

TimothySpann · ‎09-10-2018

IoT Edge Processing with Apache NiFi and MiniFi and Multiple Deep Learning Libraries Series For: https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/68140 See Part 1: https://community.hortonworks.com/articles/215079/iot-edge-processing-with-deep-learning-on-hdf-32-a.html See Part 2: https://community.hortonworks.com/articles/215258/iot-edge-processing-with-deep-learning-on-hdf-32-a-1.html See Part 3: https://community.hortonworks.com/articles/215271/iot-edge-processing-with-deep-learning-on-hdf-32-a-2.html You will notice a bit of a travel theme this article, it's because some of the images and work was done while on various holidays in August and September. Deep Learning We are running TensorFlow 1.10, Apache MXNet 1.3, NCSDK 2.05 and Neural Compute Application Zoo (NC App Zoo). Device Type 1: Plain Raspberry Pi (Found some old Kodak slides...) Main things to do is upgrade to Python 3.6, upgrade Raspberry PI to Stretch, upgrade libraries and a few reboots. Install OpenCV (or upgrade) and install Apache MXNet. You want to make sure you are on the latest version of Stretch and everything is cleaned up. Example: sudo apt-get upgrade sudo apt-get install build-essential tk-dev libncurses5-dev libncursesw5-dev libreadline6-dev libdb5.3-dev libgdbm-dev libsqlite3-dev libssl-dev libbz2-dev libexpat1-dev liblzma-dev zlib1g-dev sudo apt autoremove pip3.6 install --upgrade pip pip3.6 install mxnet git clone https://github.com/apache/incubator-mxnet.git --recursive Device Type 2: Raspberry Pi Enhanced with Movidius Neural Compute Stick I have updated the code to work with the new Movidius NCSDK 2.05. See: https://github.com/tspannhw/StrataNYC2018/blob/master/all2.py I also updated some variable formatting and added some additional values. Evolve that schema! So you can see some additional data: {"uuid": "mxnet_uuid_json_20180911021437.json", "label3": "n04081281 restaurant, eating house, eating place, eatery", "label1": "n03179701 desk", "roll": 4.0, "y": 0.0, "value5": "3.5%", "ipaddress": "192.168.1.156", "top5": "n03637318 lampshade, lamp shade", "label5": "n02788148 bannister, banister, balustrade, balusters, handrail", "host": "sensehatmovidius", "cputemp": 53, "top3pct": "6.5%", "diskfree": "5289.1 MB", "pressure": 1018.6, "cafferuntime": "111.685844ms", "label4": "n04009552 projector", "top4": "n03742115 medicine chest, medicine cabinet", "humidity": 42.5, "cputemp2": 52.62, "value2": "6.1%", "value3": "6.0%", "top2pct": "6.9%", "top1": "n02788148 bannister, banister, balustrade, balusters, handrail", "top4pct": "6.4%", "currenttime": "2018-09-11 02:14:44", "label2": "n03924679 photocopier", "top1pct": "7.3%", "top3": "n04286575 spotlight, spot", "starttime": "2018-09-11 02:14:33", "top5pct": "3.9%", "memory": 35.2, "value4": "5.0%", "top2": "n03250847 drumstick", "runtime": "11", "z": 1.0, "pitch": 360.0, "imagefilename": "/opt/demo/images/2018-09-10_2214.jpg", "tempf": 75.25, "temp": 35.14, "yaw": 86.0, "value1": "8.5%", "x": 0.0} Apache NiFi and MiniFi Process, Proxy, Access, Filter and Transform Data Anywhere, Anytime, Any Platform Apache NiFi and minifi Works in Moab Utah Resources: https://github.com/tspannhw/StrataNYC2018 https://www.geomesa.org/documentation/current/tutorials/geomesa-quickstart-nifi.html https://github.com/cinci/rpi-sense-hat-java https://movidius.github.io/ncsdk/install.html https://movidius.github.io/ncsdk/tf_modelzoo.html https://github.com/movidius/ncappzoo/ https://github.com/movidius/ncappzoo/blob/ncsdk2/tensorflow/facenet/README.md https://github.com/movidius/ncappzoo/blob/ncsdk2/tensorflow/inception_v4/README.md https://medium.com/tensorflow/tensorflow-1-9-officially-supports-the-raspberry-pi-b91669b0aa0 https://github.com/lhelontra/tensorflow-on-arm/releases/download/v1.10.0/tensorflow-1.10.0-cp35-none-linux_armv7l.whl https://github.com/movidius/ncappzoo/blob/ncsdk2/apps/image-classifier/README.md

TimothySpann · ‎08-31-2018

IoT Edge Processing with Apache NiFi and MiniFi and Multiple Deep Learning Libraries Series For: https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/68140 See Part 1: https://community.hortonworks.com/articles/215079/iot-edge-processing-with-deep-learning-on-hdf-32-a.html See Part 2: https://community.hortonworks.com/articles/215258/iot-edge-processing-with-deep-learning-on-hdf-32-a-1.html Hive - SQL - IoT Data Storage In this section, we will focus on converting JSON to AVRO to Apache ORC and storage options in Apache Hive 3. I am doing two styles of storage for one of the tables, rainbow. I am storing ORC files with an external table as well as using the Streaming API to store into an ACID table. NiFi - SQL - On Streams - Calcite SELECT * FROM FLOWFILE WHERE CAST(memory AS FLOAT) > 0 SELECT * FROM FLOWFILE WHERE CAST(tempf AS FLOAT) > 65 I check the flows as they are ingested real-time and filter based on conditions such as memory or temperature. This makes for some powerful and easy simple event processing. This is very handy when you may want to filter out standard conditions where no anomaly has occurred. IoT Data Storage Options For time series data, we are blessed with many options in HDP 3.x. The simplest choice I am doing first here. That's a simple Apache Hive 3.x table. This is where we have some tough decisions which engine to use. Hive has the best, most complete SQL and lots of interfaces. This is my default choice for where and how to store my data. If it was more than a few thousand rows a second and has a timestamp then we have to think about the architecture. Apache Druid has a lot of amazing abilities with time series data like what's coming out of these IoT devices. Since we can join Hive and Druid data and put Hive tables on top of Druid, we really should consider using Druid for our storage handler. https://cwiki.apache.org/confluence/display/Hive/Druid+Integration https://cwiki.apache.org/confluence/display/Hive/StorageHandlers https://hortonworks.com/blog/apache-hive-druid-part-1-3/ https://github.com/apache/hive/blob/master/druid-handler/src/java/org/apache/hadoop/hive/druid/DruidStorageHandlerUtils.java https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.0/using-druid/content/druid_anatomy_of_hive_to_druid.html We could create a Hive table backed by Druid thusly: CREATE TABLE rainbow_druid STOREDBY 'org.apache.hadoop.hive. druid.DruidStorageHandler' TBLPROPERTIES ( "druid.segment.granularity" = "MONTH", "druid.query.granularity" = "DAY") AS SELECT ts as`__time`, cast(tempf as string) s_tempf, ipaddress, cast(altitude as string) s_altitude, host, diskfree, FROM RAINBOW; For second and sub-second data, we need to consider either Druid or HBase. The nice thing is these NoSQL options also have SQL interfaces to use. It comes down to how you are going to query the data and which one you like. HBase + Phoenix is performant and been used in production forever. With HBase 2.x there are really impressive updates that make this a good option. For richer analytics and some really cool analytics with Apache Superset, it's hard not to recommend Druid. Apache Druid has really been improved recently and well integrated with Hive 3's rich querying. Example of Our Geo Data {"speed": "0.145", "diskfree": "4643.2 MB", "altitude": "6.2", "ts": "2018-08-30 17:47:03", "cputemp": 52.0, "latitude": "38.9789405", "track": "0.0", "memory": 26.5, "host": "rainbow", "uniqueid": "gps_uuid_20180830174705", "ipaddress": "172.20.10.8", "epd": "nan", "utc": "2018-08-30T17:47:05.000Z", "epx": "21.91", "epy": "31.536", "epv": "73.37", "ept": "0.005", "eps": "63.07", "longitude": "-74.824475167", "mode": "3", "time": "1535651225.0", "climb": "0.0", "epc": "nan"} Hive 3 Tables CREATE EXTERNAL TABLE IF NOT EXISTS rainbow (tempf DOUBLE, cputemp DOUBLE, pressure DOUBLE, host STRING, uniqueid STRING, ipaddress STRING, temp DOUBLE, diskfree STRING, altitude DOUBLE, ts STRING, tempf2 DOUBLE, memory DOUBLE) STORED AS ORC LOCATION '/rainbow' create table rainbowacid(tempf DOUBLE, cputemp DOUBLE, pressure DOUBLE, host STRING, uniqueid STRING, ipaddress STRING, temp DOUBLE, diskfree STRING, altitude DOUBLE, ts STRING, tempf2 DOUBLE, memory DOUBLE) STORED AS ORC TBLPROPERTIES ('transactional'='true') CREATE EXTERNAL TABLE IF NOT EXISTS gps (speed STRING, diskfree STRING, altitude STRING, ts STRING, cputemp DOUBLE, latitude STRING, track STRING, memory DOUBLE, host STRING, uniqueid STRING, ipaddress STRING, epd STRING, utc STRING, epx STRING, epy STRING, epv STRING, ept STRING, eps STRING, longitude STRING, mode STRING, time STRING, climb STRING, epc STRING) STORED AS ORC LOCATION '/gps' CREATE TABLE IF NOT EXISTS gpsacid (speed STRING, diskfree STRING, altitude STRING, ts STRING, cputemp DOUBLE, latitude STRING, track STRING, memory DOUBLE, host STRING, uniqueid STRING, ipaddress STRING, epd STRING, utc STRING, epx STRING, epy STRING, epv STRING, ept STRING, eps STRING, longitude STRING, mode STRING, `time` STRING, climb STRING, epc STRING) STORED AS ORC TBLPROPERTIES ('transactional'='true') CREATE EXTERNAL TABLE IF NOT EXISTS movidiussense (label5 STRING, runtime STRING, label1 STRING, diskfree STRING, top1 STRING, starttime STRING, label2 STRING, label3 STRING, top3pct STRING, host STRING, top5pct STRING, humidity DOUBLE, currenttime STRING, roll DOUBLE, uuid STRING, label4 STRING, tempf DOUBLE, y DOUBLE, top4pct STRING, cputemp2 DOUBLE, top5 STRING, top2pct STRING, ipaddress STRING, cputemp INT, pitch DOUBLE, x DOUBLE, z DOUBLE, yaw DOUBLE, pressure DOUBLE, top3 STRING, temp DOUBLE, memory DOUBLE, top4 STRING, imagefilename STRING, top1pct STRING, top2 STRING) STORED AS ORC LOCATION '/movidiussense' CREATE EXTERNAL TABLE IF NOT EXISTS minitensorflow2 (image STRING, ts STRING, host STRING, score STRING, human_string STRING, node_id INT) STORED AS ORC LOCATION '/minifitensorflow2' Resources: https://github.com/tspannhw/StrataNYC2018 https://www.geomesa.org/documentation/current/tutorials/geomesa-quickstart-nifi.html https://github.com/cinci/rpi-sense-hat-java

Online	Offline
Last Visited	‎05-20-2024 05:42 PM

Member Since	‎01-07-2019 11:58 AM
Last Visited	‎05-20-2024 05:42 PM
Posts	1,973
Kudos received	1122

Cloudera Community

Re: Has anyone tried NiFi consuming (JMSConsume) f...

Re: NiFi Crash after runing chain of lookups

Re: Recommend approach for listening to RSS Feed i...

Re: NiFi ListenFTP Processor Default Data Port

Re: Nifi: Kafka Producer with Avro format in both ...

Properties File Lookup Augmentation of Data Flow i...

Re: Creating a Spring Boot Java 8 Microservice To ...

Re: Converting PowerPoint Presentations into Frenc...

Converting PowerPoint Presentations into French fr...

Using Apache NiFi with Apache MXNet GluonCV for YO...

Running Apache MXNet Deep Learning on YARN 3.1 - H...

Re: How to read HL7 message in NiFi Processor. Is ...

Re: Ingesting RDBMS Data As New Tables Arrive - Au...

IoT Edge Processing with Deep Learning on HDF 3.2 ...

IoT Edge Processing with Deep Learning on HDF 3.2 ...