About TimothySpann

TimothySpann · ‎05-23-2017

Spacy and NLTK are for other purposes on the Tinker for NLP / Sentiment Analysis. TensorFlow is used to analyze images to figure out what the image is. The device has a camera or can acquire images elsewhere and can process them on the edge before you send to your data lake Tinker with it's 2GB and decent processor can easily run this. A 32 gb microSD card is cheap and stores 10x what I need. These all run very fast and are complete in seconds. That does not kill this box. And it's not a raspberry pi.

TimothySpann · ‎05-22-2017

Ingesting JMS Data into Hive A company has a lot of data transmitting around the enterprise asynchronously with Apache ActiveMQ, they want to tap into it and convert JSON messages coming from web servers and store into Hadoop. I am storing the data into Apache Phoenix / HBase via SQL. And also since it's so easy, I am storing the data into ORC files in HDFS for Apache Hive access. Apache NiFi 1.2 generates DDL for a Hive Table for us hive.ddl CREATE EXTERNAL TABLE IF NOT EXISTS meetup (id INT, first_name STRING, last_name STRING, email STRING, ip_address STRING, company STRING, macaddress STRING, cell_phone STRING) STORED AS ORC LOCATION '/meetup' insert into meetup (id , first_name , last_name , email , ip_address , company , macaddress , cell_phone) values (?,?,?,?,?,?,?,?) HDF Flow ConsumeJMS Path 1: Store in Hadoop as ORC with a Hive Table InferAvroSchema: get a schema from the JSON data ConvertJSONtoAVRO: build an AVRO file from JSON data MergeContent: build a larger chunk of AVRO data ConvertAvroToORC: build ORC files PutHDFS: land in your Hadoop data lake Path 2: Upsert into Phoenix (or any SQL database) EvaluateJSONPath: extract fields from the JSON file UpdateAttribute: Set SQL fields ReplaceText: Create SQL statement with ?. PutSQL: Send to Phoenix through Connection Pool (JDBC) Path 3: Store Raw JSON in Hadoop PutHDFS: Store JSON data on ingest Path 4: Call Original REST API to Obtain Data and Send to JMS GetHTTP: call a REST API to retrieve JSON arrays SplitJSON: split the JSON file into individual records PutJMS <or> PublishJMS: two ways to push messages to JMS. One uses a JMS controller and another uses a JMS client without a controller. I should benchmark this. Error Message from Failed Load If I get errors on JMS send, I send the UUID of the file to Slack for ChatOps. Zeppelin Display of SQL Data To check the tables I use Apache Zeppelin to query Phoenix and Hive tables. Formatting in UpdateAttribute for SQL Arguments To set the ? properties for the JDBC prepared statements, we go by #, starting from 1. The type is the JDBC Type, 12 is String; and the value is the value of the FlowFile fields. Yet another message queue ingested with no fuss.

TimothySpann · ‎05-22-2017

Backup Files from Hadoop ListHDFS set parameters, pick a high level directory and work down. /etc/hadoop/conf/core-site.xml FetchHDFS ${path}/${filename} PutFile Store in local Backup Hive Tables SelectHiveQL Output format AVRO, with SQL: select * from beaconstatus ConvertAVROtoORC generic for all the tables UpdateAttribute tablename ${hive.ddl:substringAfter('CREATE EXTERNAL TABLE IF NOT EXISTS '):substringBefore(' (')} PutFile Use replace directories and create missing directories with directory: /Volumes/Transcend/HadoopBackups/hive/${tablename} For Phoenix tables, I use the same ConvertAvroToORC, UpdateAttribute and PutFile boxes and just add ExecuteSQL to ingest Phoenix data. For every new table, I add one box and link it to ConvertAvroToORC. Done! This is enough of a backup so if I need to rebuild and refill my development cluster, I can do so easily. Also I have them schedule for once a day to rewrite everything. This is Not For Production or Extremely Large Data! This works great for a development cluster or personal dev cluster. You can easily backup files by ingesting with GetFile and other things can be backed up by called ExecuteStreamCommand. Local File Storage of Backed up Data drwxr-xr-x 3 tspann staff 102 May 20 23:00 any_data_trials2 drwxr-xr-x 3 tspann staff 102 May 20 22:59 any_data_meetup drwxr-xr-x 3 tspann staff 102 May 20 22:59 any_data_ibeacon drwxr-xr-x 3 tspann staff 102 May 20 22:57 any_data_gpsweather drwxr-xr-x 3 tspann staff 102 May 20 10:53 any_data_beaconstatus drwxr-xr-x 3 tspann staff 102 May 20 10:52 any_data_beacongateway drwxr-xr-x 3 tspann staff 102 May 19 17:36 any_data_atweetshive2 drwxr-xr-x 3 tspann staff 102 May 19 17:31 any_data_atweetshive Other Tools to Extract Data ShowTables to get your list and then you can grab all the DDL for the Hive tables. ddl.sql show create table atweetshive; show create table atweetshive2; show create table beacongateway; show create table beaconstatus; show create table dronedata; show create table gps; show create table gpsweather; show create table ibeacon; show create table meetup; show create table trials2; Hive Script to Export Table DDL beeline -u jdbc:hive2://myhiveserverthrift:10000/default --color=false --showHeader=false --verbose=false --silent=true --outputformat=csv -f ddl.sql Backup Zeppelin Notebooks in Bulk tar -cvf notebooks.tar /usr/hdp/current/zeppelin-server/notebook/ gzip -9 notebooks.tar scp userid@pservername:/opt/demo/notebooks.tar.gz .

TimothySpann · ‎05-21-2017

Raspberry Pi Killer? Nope, but this device has twice the RAM and a bit more performance. It's mostly compatible with Pi, but not fully. It is very new and has little ecosystem but can get the job done. Device Setup It is easy to install Python and all the libraries required for IoT and some deep learning. I found most instructions worked for Raspberry Pi on this device. It has more RAM which helps on some of these activities. I downloaded and burned with Etcher a MicroSD image of TinkerOS_Debian V1.8 (Beta version). It's a Debian variant close enough to Raspian for most IoT developers and users to be comfortable. An Android OS is now available for download as well and that may be worth trying, I am wondering if Google will add this device to the AndroidThings supported devices? Perhaps. One quirk, make sure you remember this: TinkerOS default username is “linaro”, password is “linaro”. Connect to the device via ssh linaro@SOMEIP. Python Setup sudo apt-get update sudo apt-get install cmake gcc g++ libxml2 libxml2-* leveldb* sudo apt-get install python-dev python3-dev sudo apt-get install python-setuptools sudo apt-get install python3-setuptools pip install twython pip install numpy pip install wheel pip install --user numpy scipy matplotlib ipython jupyter pandas sympy nose sudo pip install -U nltk python import nltk nltk.download() quit() pip install -U spacy python -m spacy.en.download all sudo python -m textblob.download_corpora # TensorFlow get https://github.com/samjabrahams/tensorflow-on-raspberry-pi/releases/download/v1.0.1/tensorflow-1.0.1-cp27-none-linux_armv7l.whl sudo pip install tensorflow-1.0.1-cp27-none-linux_armv7l.whl # For Python 3.4 wget https://github.com/samjabrahams/tensorflow-on-raspberry-pi/releases/download/v1.0.1/tensorflow-1.0.1-cp34-cp34m-linux_armv7l.whl sudo pip3 install tensorflow-1.0.1-cp34-cp34m-linux_armv7l.whl # For Python 2.7 sudo pip uninstall mock sudo pip install mock # For Python 3.4 sudo pip3 uninstall mock sudo pip3 install mock sudo apt-get install git git clone https://github.com/tensorflow/tensorflow.git # PAHO for MQTT pip install paho-mqtt # Flask for Web Apps pip install flask Python 2.7 and 3.4 both work fine on this device, I was also able to install the major NLP libraries including SpaCy and NLTK. TensorFlow installed using the Raspberry PI build and ran without incident. I believe it's a bit faster than the RPI version. I will have to run some tests on that. Run The Regular TensorFlow Inception V3 Demo python -W ignore /tensorflow/models/tutorials/image/imagenet/classify_image.py --image_file /opt/demo/tensorflow/TimSpann.jpg I hacked that version to add code to send the results to MQTT so I could process with most IoT hubs and Apache NiFi with ease. JSON is a very simple format to work with. Custom Python to Call NiFi # .... imports import paho.mqtt.client as paho import os import json # .... later in the code top_k = predictions.argsort()[-FLAGS.num_top_predictions:][::-1] for node_id in top_k: human_string = node_lookup.id_to_string(node_id) score = predictions[node_id] print('==> %s (score = %.5f)' % (human_string, score)) row = [ { 'human_string,': str(human_string), 'score,': str(score)} ] json_string = json.dumps(row) client = paho.Client() client.connect("192.168.1.151", 1883, 60) client.publish("tinker1", payload=json_string, qos=0, retain=True) NIFI Ingest Ingesting MQTT is easy and again that's our choice from the TinkerBoard. I have formatted the TensorFlow data as JSON and we quickly ingest and drop to a file. We could do anything with this flow file include store in Hadoop, Hive, Phoenix, HBase or send it to Kafka or transform it. So now we have yet another platform that can be used for IoT and basic Deep Learning and NLP. All enabled by a small fast linux device that runs Python. Enjoy your SBC! I am hoping that they add hats, a hard drive and some other ASUS accessories. Make your own mini Debian laptop would be cool. The next device I am looking at is NVIDIA's powerful GPU SBCs. There's a couple options from 192 GPU cores up to 256 with smoking high-end specs. Example Data [{"score,": "0.034884", "human_string,": "neck brace"}] Downloads ASUS SBC Download TinkerBoard FAQ Scripts Modified TensorFlow example /models/tutorials/image/imagenet/classify_image.py

TimothySpann · ‎05-12-2017

Minio should probably be in the same region as those regions are Amazon S3 specific. Minio runs on one machine and is not part of AWS infrastructure.

TimothySpann · ‎05-09-2017

See Part 1: https://community.hortonworks.com/articles/101679/iot-ingesting-gps-data-from-raspberry-pi-zero-wire.html Augment Your Streaming Ingested Data So with Hortonworks Data Flow : Flow Processing I can augment any stream of data in motion. My GPS data stream from my RPIWZ is returning latitude and longitude. That is valuable data that can be input to a lot of different APIs. One that seemed relevant to me was the current weather conditions where the device was. You could also do some predictive analytics to figure where you might be and check the current weather conditions or a weather forecast depending on how long it will take you to get there. A GPS is giving us latitude and longitude that can be used as input to many kinds of APIs. Once use case that works for a moving device is checking the current weather conditions where you are. Weather Underground has a REST API call that will convert Lat/Long to City/State (there are other APIs that will do this as well, please post them). Once I have City and State I use that to get current weather conditions where this device is. Apache NiFi Flow Augment Live Data EvaluateJsonPath : create two variables (latitude and longitude) by extracting JSON fields from the flow file sent via MQTT. $.latitude and $.longitude InvokeHTTP: http://api.wunderground.com/api/TheOneKey/geolookup/q/${latitude},${longitude}.json EvaluateJsonPath: create two variables (city and state) by extracting JSON fields from the flow file sent from REST call. $.location.city $.location.state. InvokeHTTP: http://api.wunderground.com/api/OneKeyToRule/conditions/q/${state}/${city:urlEncode()}.json City has space and perhaps other HTTP GET unfriendly data. EvaluateJsonPath : extract all the weather fields I am interested in such as $.current_observation.dewpoint_string. UpdateAttribute: A few field names Avro Schema's do not like. ${'Last-Modified'} I convert that to lastModified. AttributesToJSON: Convert just the fields I want to a new flow file. Set property to flowfile-content. temp_c,temp_f,wind_mph,wind_degrees,temperature_string,weather,feelslike_string,state,longitude,windchill_string,wind_string,relative_humidity,pressure_mb,observation_time,city,windchill_f,windchill_c,latitude,precip_today_string,lastModified,dewpoint_string,heat_index_string,visibility_mi InferAvroSchema : Let Nifi decide what this AVRO record should look like. I will add a part three using the Schema Registry and Apache NiFi 1.2 that will use a schema lookup. ConvertJSONtoAvro : And make it SNAPPY! MergeContent: As AVRO. MergeContent: As AVRO. ConvertAvroToORC : ORC is the optimal choice for Hive LLAP and Spark SQL to query the data. PutHDFS : Point to your configuration Nifi relative file system file /etc/hadoop/conf/core-site.xml. I like to set for replace conflict resolution and set your directory. Make sure you created your file directory and Nifi has permissions to write there. The quickest is to log into a machine with HDFS client. su hdfs hdfs dfs -mkdir -p /rpwz/gpsweather hdfs dfs -chmod -R 777 /rpwz Enhancement Ideas: Limit the # of calls to Weather Underground, try other weather services like NOAA and Weather Source. You may want to try other nation's weather APIs as well. Create an External Hive Table and Display the Data (Zeppelin Can Do That) Zeppelin let's me create the table with the DDL produced by the stream. Then I can query it easily. %jdbc(hive) CREATE EXTERNAL TABLE IF NOT EXISTS gpsweather (observation_time STRING, dewpoint_string STRING, city STRING, windchill_f STRING, windchill_c STRING, latitude STRING, precip_today_string STRING, temp_c STRING, temp_f STRING, windchill_string STRING, wind_mph STRING, wind_degrees STRING, temperature_string STRING, weather STRING, feelslike_string STRING, wind_string STRING, heat_index_string STRING, state STRING, lastModified STRING, relative_humidity STRING, pressure_mb STRING, visibility_mi STRING, longitude STRING) STORED AS ORC LOCATION '/rpwz/gpsweather' %jdbc(hive) select * from gpsweather Reference: http://www.wunderground.com/ https://openweathermap.org/forecast5 https://developers.google.com/maps/ Note: Weather Underground has a free API that can use up to a certain # of calls. You will easily blast past that if you are tracking live objects, even if you decide to only enrich when points change or at 15 minute intervals. For testing, you may want to store weather underground results in JSON files in a directory and use GetFile to stand in for those calls. The same with data from your devices. Download the files from the Data Provenance and you can use them as stand-ins for live data for integration testing, especially when you may be on a plane or somewhere where you cannot reach your device.

TimothySpann · ‎05-08-2017

Where in the world is Tim Spann, The NiFi Guy? I recommend the BU-353-S4 USB GPS, it works well with Raspberry PIs and is very affordable. Connecting this to a RPIWZ, I can run this on a small battery and bring this everywhere for location tracking. Put it on your delivery truck, chasis, train, plane, drone, robot and more. I'll track myself so I can own the data. What do these GPS Fields Mean? EPS - Error Estimate in Meter/Second EPX = Estimated Longitude Error in Meters EPV = Estimated Vertical Error in Meters EPT = Estimated Timestamp Error Speed = Speed !!! Climb = Climb (Positive) or Sink (Negative) rate in meters per second of upwards or downwards movement. Track = Course over ground in degrees from True North Mode = NMEA mode; values are 0 - NA, 1 - No Fix, 2D and 3D. My question is if you already have an estimate of the error, do something about it!!! Python Walk Through First install the utilities you need for GPS and Python. We also install NTP to get as accurate time as possible. sudo apt-get install gpsd gpsd-clients python-gps ntp For testing to make sure that everything works, try two of these GPS utilities. Make sure you have the USB plugged in, you will need a RPIZero adapter to convert from little USB to normal size. I then connect a small USB hub to connect the GPS unit as well as sometimes mouse and keyboard. Get one of these, you will need it. You will also need an adapter from little to full size HDMI. You only really need the mouse, keyboard and monitor while you are doing the first setup up WIFI. Once that's setup, just SSH into your device and forget it. cgps gpsmon gpxlogger dumps XML data in GPX format gpspipe -l -o test.json -p -w -n 10 without -o goes to STDOUT/STIN These will work from the command line and give you a read out. It will take a few seconds or maybe a minute the first time to calibrate. If you don't get any numbers, stick your GPS on a window or put it outside. If you have to manually run the GPS demon: gpsd -n -D 2 /dev/ttyUSB0 I found some code to read the GPS sensor over USB with Python. From there I modified the code for a slower refresh and no UI as I want to just send this data to Apache NiFi over MQTT using Eclipse Paho MQTT client. One enhancement I have considered is an offline mode to save all the data as a buffer and then on reconnect mass send the rest. Also you could search for other WiFi signals and try to use open and free ones. You probably want to then add SQL, encryption and some other controls. Or you could install and use Apache MiniFi Java or C++ agent on the Zero. #! /usr/bin/python # Based on # Written by Dan Mandle http://dan.mandle.me September 2012 # License: GPL 2.0 import os from gps import * from time import * import time import threading import json import paho.mqtt.client as paho gpsd = None class GpsPoller(threading.Thread): def __init__(self): threading.Thread.__init__(self) global gpsd #bring it in scope gpsd = gps(mode=WATCH_ENABLE) #starting the stream of info self.current_value = None self.running = True #setting the thread running to true def run(self): global gpsd while gpsp.running: gpsd.next() #this will continue to loop and grab EACH set of gpsd info to clear the buffer if __name__ == '__main__': gpsp = GpsPoller() # create the thread try: gpsp.start() # start it up while True: if gpsd.fix.latitude > 0: row = [ { 'latitude': str(gpsd.fix.latitude), 'longitude': str(gpsd.fix.longitude), 'utc': str(gpsd.utc), 'time': str(gpsd.fix.time), 'altitude': str(gpsd.fix.altitude), 'eps': str(gpsd.fix.eps), 'epx': str(gpsd.fix.epx), 'epv': str(gpsd.fix.epv), 'ept': str(gpsd.fix.ept), 'speed': str(gpsd.fix.speed), 'climb': str(gpsd.fix.climb), 'track': str(gpsd.fix.track), 'mode': str(gpsd.fix.mode)} ] json_string = json.dumps(row) client = paho.Client() client.username_pw_set("jrfcwrim","UhBGemEoqf0D") client.connect("m13.cloudmqtt.com", 14162, 60) client.publish("rpiwzgps", payload=json_string, qos=0, retain=True) time.sleep(60) except (KeyboardInterrupt, SystemExit): #when you press ctrl+c gpsp.running = False gpsp.join() # wait for the thread to finish what it's doing Example JSON Data {"track": "0.0", "speed": "0.0", "utc": "2017-05-01T23:49:46.000Z", "epx": "8.938", "epv": "29.794", "altitude": "40.742", "eps": "23.66", "longitude": "-74.529216408", "mode": "3", "time": "2017-05-01T23:49:46.000Z", "latitude": "40.268141521", "climb": "0.0", "ept": "0.005"} Ingest Any Data Any Where Any Time Any Dimension in Time and Space 1. ConsumeMQTT 2. InferAvroSchema 3. ConvertJSONtoAvro 4. MergeContent 5. ConvertAvroToORC 6. PutHDFS Visualize Whirled Peas We turn this raw data into Hive tables and then visualize into pretty tables and charts with Apache Zeppelin. You could also report with any ODBC and JDBC reporting tool like Tableau or PowerBI. Store It Where? su hdfs hdfs dfs -mkdir -p /rpwz/gps hdfs dfs -chmod -R 777 /rpwz/gps If you store it you can query it! Why Do I love Apache NiFi? Let me count the ways.... Instead of hand-rolling some Hive DDL, NiFi will automagically generate all the DDL I need based on an inferred AVRO Schema (soon using Schema Registry lookup!!!). So you can easily drop in a file, convert to ORC, save to HDFS, generate an external Hive table and query it in seconds. All with no coding. Very easy to wire this to send a message to a front end via Web Sockets, JMS, AJAX, ... So we can drop a file in S3 or HDFS, convert to ORC for mega fast LLAP queries and tell a front-end what the table is and it could query it. References: http://www.danmandle.com/blog/getting-gpsd-to-work-with-python/ http://www.d.umn.edu/~and02868/projects.html http://kingtidesailing.blogspot.com/2016/02/connect-globalsat-bu-353-usb-gps-to.html http://blog.petrilopia.net/linux/raspberry-pi-gps-working/ https://community.hortonworks.com/articles/72420/ingesting-remote-sensor-feeds-into-apache-phoenix.html https://www.systutorials.com/docs/linux/man/1-gpspipe/ https://www.systutorials.com/docs/linux/man/1-gps/ http://usglobalsat.com/p-688-bu-353-s4.aspx https://www.raspberrypi.org/products/pi-zero-w/ Source Code: https://community.hortonworks.com/repos/101680/rpi-zero-wireless-nifi-mqtt-gps.html?shortDescriptionMaxLength=140 Quick Tip: Utilize Apache NiFi's scheduler to limit the # of calls you make to third-party services, a single NiFi instance can easily overwhelm most free tiers of services. I made 122 calls to Weather Underground in a few seconds. So set those times! for weather once every 15 minutes or 30 or even 1 hour is good. Note: GPS information can also be read from drones, cars, phones and lots of custom sensors in IIOT devices.

TimothySpann · ‎05-08-2017

Login with username

TimothySpann · ‎05-08-2017

The complete code is in the Zeppelin Spark Python notebook referenced here: https://github.com/zaratsian/PySpark/blob/master/text_analytics_datadriven_topics.json

TimothySpann · ‎05-05-2017

Traceback (most recent call last): File "testhive.py", line 1, in <module> from pyhive import hive File "/usr/local/lib/python2.7/site-packages/pyhive/hive.py", line 10, in <module> from TCLIService import TCLIService File "/usr/local/lib/python2.7/site-packages/TCLIService/TCLIService.py", line 9, in <module> from thrift.Thrift import TType, TMessageType, TFrozenDict, TException, TApplicationException ImportError: No module named thrift.Thrift Any ideas?

Online	Offline
Last Visited	‎02-05-2026 01:38 AM

Member Since	‎01-07-2019 11:58 AM
Last Visited	‎02-05-2026 01:38 AM
Posts	1,973
Kudos received	1121

Cloudera Community

Re: Has anyone tried NiFi consuming (JMSConsume) f...

Re: NiFi Crash after runing chain of lookups

Re: Recommend approach for listening to RSS Feed i...

Re: NiFi ListenFTP Processor Default Data Port

Re: Nifi: Kafka Producer with Avro format in both ...

Re: Using An ASUS TinkerBoard with TensorFlow and ...

Ingesting and Testing JMS Data with NiFi into Hive

Easy Hadoop Backup

Using An ASUS TinkerBoard with TensorFlow and Pyth...

Re: Working with S3 Compatible Data Stores via Apa...

Part 2: IoT: Augmenting GPS Data with Weather

IoT: Ingesting GPS Data from Raspberry PI Zero Wi...

Re: Query Hive Using Python

Re: Spark Text Analytics - Uncovering Data-Driven ...

Re: Query Hive Using Python