1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2468 | 04-03-2024 06:39 AM | |
| 3814 | 01-12-2024 08:19 AM | |
| 2056 | 12-07-2023 01:49 PM | |
| 3044 | 08-02-2023 07:30 AM | |
| 4180 | 03-29-2023 01:22 PM |
05-23-2017
04:53 PM
Spacy and NLTK are for other purposes on the Tinker for NLP / Sentiment Analysis. TensorFlow is used to analyze images to figure out what the image is. The device has a camera or can acquire images elsewhere and can process them on the edge before you send to your data lake Tinker with it's 2GB and decent processor can easily run this. A 32 gb microSD card is cheap and stores 10x what I need. These all run very fast and are complete in seconds. That does not kill this box. And it's not a raspberry pi.
... View more
05-22-2017
11:23 PM
2 Kudos
Ingesting JMS Data into Hive A company has a lot of data transmitting around the enterprise asynchronously with Apache ActiveMQ, they want to tap into it and convert JSON messages coming from web servers and store into Hadoop. I am storing the data into Apache Phoenix / HBase via SQL. And also since it's so easy, I am storing the data into ORC files in HDFS for Apache Hive access.
Apache NiFi 1.2 generates DDL for a Hive Table for us hive.ddl
CREATE EXTERNAL TABLE IF NOT EXISTS meetup
(id INT, first_name STRING, last_name STRING, email STRING, ip_address STRING, company STRING, macaddress STRING, cell_phone STRING) STORED AS ORC
LOCATION '/meetup'
insert into meetup
(id , first_name , last_name , email , ip_address , company , macaddress , cell_phone)
values
(?,?,?,?,?,?,?,?)
HDF Flow
ConsumeJMS Path 1: Store in Hadoop as ORC with a Hive Table
InferAvroSchema: get a schema from the JSON data ConvertJSONtoAVRO: build an AVRO file from JSON data MergeContent: build a larger chunk of AVRO data ConvertAvroToORC: build ORC files PutHDFS: land in your Hadoop data lake
Path 2: Upsert into Phoenix (or any SQL database)
EvaluateJSONPath: extract fields from the JSON file UpdateAttribute: Set SQL fields ReplaceText: Create SQL statement with ?. PutSQL: Send to Phoenix through Connection Pool (JDBC) Path 3: Store Raw JSON in Hadoop
PutHDFS: Store JSON data on ingest
Path 4: Call Original REST API to Obtain Data and Send to JMS
GetHTTP: call a REST API to retrieve JSON arrays SplitJSON: split the JSON file into individual records PutJMS <or> PublishJMS: two ways to push messages to JMS. One uses a JMS controller and another uses a JMS client without a controller. I should benchmark this.
Error Message from Failed Load
If I get errors on JMS send, I send the UUID of the file to Slack for ChatOps.
Zeppelin Display of SQL Data
To check the tables I use Apache Zeppelin to query Phoenix and Hive tables.
Formatting in UpdateAttribute for SQL Arguments To set the ? properties for the JDBC prepared statements, we go by #, starting from 1. The type is the JDBC Type, 12 is String; and the value is the value of the FlowFile fields. Yet another message queue ingested with no fuss.
... View more
Labels:
05-22-2017
08:16 PM
4 Kudos
Backup Files from Hadoop
ListHDFS
set parameters, pick a high level directory and work down. /etc/hadoop/conf/core-site.xml FetchHDFS
${path}/${filename} PutFile Store in local Backup Hive Tables SelectHiveQL
Output format AVRO, with SQL: select * from beaconstatus ConvertAVROtoORC
generic for all the tables UpdateAttribute
tablename ${hive.ddl:substringAfter('CREATE EXTERNAL TABLE IF NOT EXISTS '):substringBefore(' (')} PutFile
Use replace directories and create missing directories with directory: /Volumes/Transcend/HadoopBackups/hive/${tablename} For Phoenix tables, I use the same ConvertAvroToORC, UpdateAttribute and PutFile boxes and just add ExecuteSQL to ingest Phoenix data. For every new table, I add one box and link it to ConvertAvroToORC. Done! This is enough of a backup so if I need to rebuild and refill my development cluster, I can do so easily. Also I have them schedule for once a day to rewrite everything. This is Not For Production or Extremely Large Data! This works great for a development cluster or personal dev cluster. You can easily backup files by ingesting with GetFile and other things can be backed up by called ExecuteStreamCommand. Local File Storage of Backed up Data drwxr-xr-x 3 tspann staff 102 May 20 23:00 any_data_trials2
drwxr-xr-x 3 tspann staff 102 May 20 22:59 any_data_meetup
drwxr-xr-x 3 tspann staff 102 May 20 22:59 any_data_ibeacon
drwxr-xr-x 3 tspann staff 102 May 20 22:57 any_data_gpsweather
drwxr-xr-x 3 tspann staff 102 May 20 10:53 any_data_beaconstatus
drwxr-xr-x 3 tspann staff 102 May 20 10:52 any_data_beacongateway
drwxr-xr-x 3 tspann staff 102 May 19 17:36 any_data_atweetshive2
drwxr-xr-x 3 tspann staff 102 May 19 17:31 any_data_atweetshive
Other Tools to Extract Data ShowTables to get your list and then you can grab all the DDL for the Hive tables. ddl.sql show create table atweetshive;
show create table atweetshive2;
show create table beacongateway;
show create table beaconstatus;
show create table dronedata;
show create table gps;
show create table gpsweather;
show create table ibeacon;
show create table meetup;
show create table trials2; Hive Script to Export Table DDL beeline -u jdbc:hive2://myhiveserverthrift:10000/default --color=false --showHeader=false --verbose=false --silent=true --outputformat=csv -f ddl.sql Backup Zeppelin Notebooks in Bulk tar -cvf notebooks.tar /usr/hdp/current/zeppelin-server/notebook/
gzip -9 notebooks.tar
scp userid@pservername:/opt/demo/notebooks.tar.gz .
... View more
Labels:
05-21-2017
02:59 PM
6 Kudos
Raspberry Pi Killer? Nope, but this device has twice the RAM and a bit more performance. It's mostly compatible with Pi, but not fully. It is very new and has little ecosystem but can get the job done. Device Setup It is easy to install Python and all the libraries required for IoT and some deep learning. I found most instructions worked for Raspberry Pi on this device. It has more RAM which helps on some of these activities. I downloaded and burned with Etcher a MicroSD image of TinkerOS_Debian V1.8 (Beta version). It's a Debian variant close enough to Raspian for most IoT developers and users to be comfortable. An Android OS is now available for download as well and that may be worth trying, I am wondering if Google will add this device to the AndroidThings supported devices? Perhaps. One quirk, make sure you remember this: TinkerOS default username is “linaro”, password is “linaro”. Connect to the device via ssh linaro@SOMEIP. Python Setup sudo apt-get update
sudo apt-get install cmake gcc g++ libxml2 libxml2-* leveldb*
sudo apt-get install python-dev python3-dev
sudo apt-get install python-setuptools
sudo apt-get install python3-setuptools
pip install twython
pip install numpy
pip install wheel
pip install --user numpy scipy matplotlib ipython jupyter pandas sympy nose
sudo pip install -U nltk
python
import nltk
nltk.download()
quit()
pip install -U spacy
python -m spacy.en.download all
sudo python -m textblob.download_corpora
# TensorFlow
get https://github.com/samjabrahams/tensorflow-on-raspberry-pi/releases/download/v1.0.1/tensorflow-1.0.1-cp27-none-linux_armv7l.whl
sudo pip install tensorflow-1.0.1-cp27-none-linux_armv7l.whl
# For Python 3.4
wget https://github.com/samjabrahams/tensorflow-on-raspberry-pi/releases/download/v1.0.1/tensorflow-1.0.1-cp34-cp34m-linux_armv7l.whl
sudo pip3 install tensorflow-1.0.1-cp34-cp34m-linux_armv7l.whl
# For Python 2.7
sudo pip uninstall mock
sudo pip install mock
# For Python 3.4
sudo pip3 uninstall mock
sudo pip3 install mock
sudo apt-get install git
git clone https://github.com/tensorflow/tensorflow.git
# PAHO for MQTT
pip install paho-mqtt
# Flask for Web Apps
pip install flask
Python 2.7 and 3.4 both work fine on this device, I was also able to install the major NLP libraries including SpaCy and NLTK. TensorFlow installed using the Raspberry PI build and ran without incident. I believe it's a bit faster than the RPI version. I will have to run some tests on that. Run The Regular TensorFlow Inception V3 Demo python -W ignore /tensorflow/models/tutorials/image/imagenet/classify_image.py --image_file /opt/demo/tensorflow/TimSpann.jpg I hacked that version to add code to send the results to MQTT so I could process with most IoT hubs and Apache NiFi with ease. JSON is a very simple format to work with. Custom Python to Call NiFi # .... imports
import paho.mqtt.client as paho
import os
import json
# .... later in the code
top_k = predictions.argsort()[-FLAGS.num_top_predictions:][::-1]
for node_id in top_k:
human_string = node_lookup.id_to_string(node_id)
score = predictions[node_id]
print('==> %s (score = %.5f)' % (human_string, score))
row = [ { 'human_string,': str(human_string), 'score,': str(score)} ]
json_string = json.dumps(row)
client = paho.Client()
client.connect("192.168.1.151", 1883, 60)
client.publish("tinker1", payload=json_string, qos=0, retain=True)
NIFI Ingest Ingesting MQTT is easy and again that's our choice from the TinkerBoard. I have formatted the TensorFlow data as JSON and we quickly ingest and drop to a file. We could do anything with this flow file include store in Hadoop, Hive, Phoenix, HBase or send it to Kafka or transform it. So now we have yet another platform that can be used for IoT and basic Deep Learning and NLP. All enabled by a small fast linux device that runs Python. Enjoy your SBC! I am hoping that they add hats, a hard drive and some other ASUS accessories. Make your own mini Debian laptop would be cool. The next device I am looking at is NVIDIA's powerful GPU SBCs. There's a couple options from 192 GPU cores up to 256 with smoking high-end specs. Example Data [{"score,": "0.034884", "human_string,": "neck brace"}] Downloads ASUS SBC Download TinkerBoard FAQ Scripts Modified TensorFlow example /models/tutorials/image/imagenet/classify_image.py
... View more
Labels:
05-12-2017
06:30 PM
Minio should probably be in the same region as those regions are Amazon S3 specific. Minio runs on one machine and is not part of AWS infrastructure.
... View more
05-09-2017
01:45 PM
5 Kudos
See Part 1: https://community.hortonworks.com/articles/101679/iot-ingesting-gps-data-from-raspberry-pi-zero-wire.html
Augment Your Streaming Ingested Data
So with Hortonworks Data Flow : Flow Processing I can augment any stream of data in motion. My GPS data stream from my RPIWZ is returning latitude and longitude. That is valuable data that can be input to a lot of different APIs. One that seemed relevant to me was the current weather conditions where the device was. You could also do some predictive analytics to figure where you might be and check the current weather conditions or a weather forecast depending on how long it will take you to get there.
A GPS is giving us latitude and longitude that can be used as input to many kinds of APIs. Once use case that works for a moving device is checking the current weather conditions where you are. Weather Underground has a REST API call that will convert Lat/Long to City/State (there are other APIs that will do this as well, please post them). Once I have City and State I use that to get current weather conditions where this device is.
Apache NiFi Flow
Augment Live Data
EvaluateJsonPath
: create two variables (latitude and longitude) by extracting JSON fields from the flow file sent via MQTT. $.latitude and $.longitude
InvokeHTTP: http://api.wunderground.com/api/TheOneKey/geolookup/q/${latitude},${longitude}.json
EvaluateJsonPath: create two variables (city and state) by extracting JSON fields from the flow file sent from REST call. $.location.city $.location.state.
InvokeHTTP: http://api.wunderground.com/api/OneKeyToRule/conditions/q/${state}/${city:urlEncode()}.json City has space and perhaps other HTTP GET unfriendly data.
EvaluateJsonPath
: extract all the weather fields I am interested in such as $.current_observation.dewpoint_string.
UpdateAttribute: A few field names Avro Schema's do not like. ${'Last-Modified'} I convert that to lastModified.
AttributesToJSON: Convert just the fields I want to a new flow file. Set property to flowfile-content.
temp_c,temp_f,wind_mph,wind_degrees,temperature_string,weather,feelslike_string,state,longitude,windchill_string,wind_string,relative_humidity,pressure_mb,observation_time,city,windchill_f,windchill_c,latitude,precip_today_string,lastModified,dewpoint_string,heat_index_string,visibility_mi
InferAvroSchema
: Let Nifi decide what this AVRO record should look like. I will add a part three using the Schema Registry and Apache NiFi 1.2 that will use a schema lookup.
ConvertJSONtoAvro
: And make it SNAPPY!
MergeContent: As AVRO.
MergeContent: As AVRO.
ConvertAvroToORC
: ORC is the optimal choice for Hive LLAP and Spark SQL to query the data.
PutHDFS
: Point to your configuration Nifi relative file system file /etc/hadoop/conf/core-site.xml. I like to set for replace conflict resolution and set your directory. Make sure you created your file directory and Nifi has permissions to
write there.
The quickest is to log into a machine with HDFS client.
su hdfs
hdfs dfs -mkdir -p /rpwz/gpsweather
hdfs dfs -chmod -R 777 /rpwz
Enhancement Ideas:
Limit the # of calls to Weather Underground, try other weather services like NOAA and Weather Source. You may want to try other nation's weather APIs as well.
Create an External Hive Table and Display the Data (Zeppelin Can Do That)
Zeppelin let's me create the table with the DDL produced by the stream.
Then I can query it easily.
%jdbc(hive)
CREATE EXTERNAL TABLE IF NOT EXISTS gpsweather (observation_time STRING, dewpoint_string STRING, city STRING, windchill_f STRING, windchill_c STRING, latitude STRING, precip_today_string STRING, temp_c STRING, temp_f STRING, windchill_string STRING, wind_mph STRING, wind_degrees STRING, temperature_string STRING, weather STRING, feelslike_string STRING, wind_string STRING, heat_index_string STRING, state STRING, lastModified STRING, relative_humidity STRING, pressure_mb STRING, visibility_mi STRING, longitude STRING)
STORED AS ORC LOCATION '/rpwz/gpsweather'
%jdbc(hive)
select * from gpsweather
Reference:
http://www.wunderground.com/
https://openweathermap.org/forecast5
https://developers.google.com/maps/
Note:
Weather Underground has a free API that can use up to a certain # of calls. You will easily blast past that if you are tracking live objects, even if you decide to only enrich when points change or at 15 minute intervals.
For testing, you may want to store weather underground results in JSON files in a directory and use GetFile to stand in for those calls. The same with data from your devices. Download the files from the Data Provenance and you
can use them as stand-ins for live data for integration testing, especially when you may be on a plane or somewhere where you cannot reach your device.
... View more
Labels:
05-08-2017
07:23 PM
8 Kudos
Where in the world is Tim Spann, The NiFi Guy? I recommend the BU-353-S4 USB GPS, it works well with Raspberry PIs and is very affordable. Connecting this to a RPIWZ, I can run this on a small battery and bring this everywhere for location tracking. Put it on your delivery truck, chasis, train, plane, drone, robot and more. I'll track myself so I can own the data.
What do these GPS Fields Mean? EPS - Error Estimate in Meter/Second EPX = Estimated Longitude Error in Meters EPV = Estimated Vertical Error in Meters EPT = Estimated Timestamp Error Speed = Speed !!! Climb = Climb (Positive) or Sink (Negative) rate in meters per second of upwards or downwards movement. Track = Course over ground in degrees from True North Mode = NMEA mode; values are 0 - NA, 1 - No Fix, 2D and 3D. My question is if you already have an estimate of the error, do something about it!!! Python Walk Through First install the utilities you need for GPS and Python. We also install NTP to get as accurate time as possible. sudo apt-get install gpsd gpsd-clients python-gps ntp For testing to make sure that everything works, try two of these GPS utilities. Make sure you have the USB plugged in, you will need a RPIZero adapter to convert from little USB to normal size. I then connect a small USB hub to connect the GPS unit as well as sometimes mouse and keyboard. Get one of these, you will need it. You will also need an adapter from little to full size HDMI. You only really need the mouse, keyboard and monitor while you are doing the first setup up WIFI. Once that's setup, just SSH into your device and forget it.
cgps gpsmon
gpxlogger dumps XML data in GPX format gpspipe -l -o test.json -p -w -n 10 without -o goes to STDOUT/STIN These will work from the command line and give you a read out. It will take a few seconds or maybe a minute the first time to calibrate. If you don't get any numbers, stick your GPS on a window or put it outside. If you have to manually run the GPS demon: gpsd -n -D 2 /dev/ttyUSB0 I found some code to read the GPS sensor over USB with Python. From there I modified the code for a slower refresh and no UI as I want to just send this data to Apache NiFi over MQTT using Eclipse Paho MQTT client. One enhancement I have considered is an offline mode to save all the data as a buffer and then on reconnect mass send the rest. Also you could search for other WiFi signals and try to use open and free ones. You probably want to then add SQL, encryption and some other controls. Or you could install and use Apache MiniFi Java or C++ agent on the Zero. #! /usr/bin/python
# Based on
# Written by Dan Mandle http://dan.mandle.me September 2012
# License: GPL 2.0
import os
from gps import *
from time import *
import time
import threading
import json
import paho.mqtt.client as paho
gpsd = None
class GpsPoller(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
global gpsd #bring it in scope
gpsd = gps(mode=WATCH_ENABLE) #starting the stream of info
self.current_value = None
self.running = True #setting the thread running to true
def run(self):
global gpsd
while gpsp.running:
gpsd.next() #this will continue to loop and grab EACH set of gpsd info to clear the buffer
if __name__ == '__main__':
gpsp = GpsPoller() # create the thread
try:
gpsp.start() # start it up
while True:
if gpsd.fix.latitude > 0:
row = [ { 'latitude': str(gpsd.fix.latitude),
'longitude': str(gpsd.fix.longitude),
'utc': str(gpsd.utc),
'time': str(gpsd.fix.time),
'altitude': str(gpsd.fix.altitude),
'eps': str(gpsd.fix.eps),
'epx': str(gpsd.fix.epx),
'epv': str(gpsd.fix.epv),
'ept': str(gpsd.fix.ept),
'speed': str(gpsd.fix.speed),
'climb': str(gpsd.fix.climb),
'track': str(gpsd.fix.track),
'mode': str(gpsd.fix.mode)} ]
json_string = json.dumps(row)
client = paho.Client()
client.username_pw_set("jrfcwrim","UhBGemEoqf0D")
client.connect("m13.cloudmqtt.com", 14162, 60)
client.publish("rpiwzgps", payload=json_string, qos=0, retain=True)
time.sleep(60)
except (KeyboardInterrupt, SystemExit): #when you press ctrl+c
gpsp.running = False
gpsp.join() # wait for the thread to finish what it's doing
Example JSON Data {"track": "0.0", "speed": "0.0", "utc": "2017-05-01T23:49:46.000Z", "epx": "8.938", "epv": "29.794", "altitude": "40.742", "eps": "23.66", "longitude": "-74.529216408", "mode": "3", "time": "2017-05-01T23:49:46.000Z", "latitude": "40.268141521", "climb": "0.0", "ept": "0.005"} Ingest Any Data Any Where Any Time Any Dimension in Time and Space
1. ConsumeMQTT 2. InferAvroSchema 3. ConvertJSONtoAvro 4. MergeContent 5. ConvertAvroToORC 6. PutHDFS
Visualize Whirled Peas
We turn this raw data into Hive tables and then visualize into pretty tables and charts with Apache Zeppelin. You could also report with any ODBC and JDBC reporting tool like Tableau or PowerBI.
Store It Where? su hdfs
hdfs dfs -mkdir -p /rpwz/gps
hdfs dfs -chmod -R 777 /rpwz/gps If you store it you can query it!
Why Do I love Apache NiFi? Let me count the ways.... Instead of hand-rolling some Hive DDL, NiFi will automagically generate all the DDL I need based on an inferred AVRO Schema (soon using Schema Registry lookup!!!). So you can easily drop in a file, convert to ORC, save to HDFS, generate an external Hive table and query it in seconds. All with no coding. Very easy to wire this to send a message to a front end via Web Sockets, JMS, AJAX, ... So we can drop a file in S3 or HDFS, convert to ORC for mega fast LLAP queries and tell a front-end what the table is and it could query it.
References: http://www.danmandle.com/blog/getting-gpsd-to-work-with-python/ http://www.d.umn.edu/~and02868/projects.html http://kingtidesailing.blogspot.com/2016/02/connect-globalsat-bu-353-usb-gps-to.html http://blog.petrilopia.net/linux/raspberry-pi-gps-working/ https://community.hortonworks.com/articles/72420/ingesting-remote-sensor-feeds-into-apache-phoenix.html https://www.systutorials.com/docs/linux/man/1-gpspipe/ https://www.systutorials.com/docs/linux/man/1-gps/ http://usglobalsat.com/p-688-bu-353-s4.aspx
https://www.raspberrypi.org/products/pi-zero-w/ Source Code:
https://community.hortonworks.com/repos/101680/rpi-zero-wireless-nifi-mqtt-gps.html?shortDescriptionMaxLength=140
Quick Tip:
Utilize Apache NiFi's scheduler to limit the # of calls you make to third-party services, a single NiFi instance can easily overwhelm most free tiers of services. I made 122 calls to Weather Underground in a few seconds. So set those times! for weather once every 15 minutes or 30 or even 1 hour is good. Note: GPS information can also be read from drones, cars, phones and lots of custom sensors in IIOT devices.
... View more
Labels:
05-08-2017
05:40 PM
The complete code is in the Zeppelin Spark Python notebook referenced here: https://github.com/zaratsian/PySpark/blob/master/text_analytics_datadriven_topics.json
... View more
05-05-2017
05:00 PM
Traceback (most recent call last):
File "testhive.py", line 1, in <module>
from pyhive import hive
File "/usr/local/lib/python2.7/site-packages/pyhive/hive.py", line 10, in <module>
from TCLIService import TCLIService
File "/usr/local/lib/python2.7/site-packages/TCLIService/TCLIService.py", line 9, in <module>
from thrift.Thrift import TType, TMessageType, TFrozenDict, TException, TApplicationException
ImportError: No module named thrift.Thrift Any ideas?
... View more