About TimothySpann

TimothySpann · ‎10-22-2019

Edge Data Processing with Jetson Nano Part 3 - AI Integration Source: https://github.com/tspannhw/minifi-jetson-nano Part 1: https://www.datainmotion.dev/2019/07/edge-data-processing-with-jetson-nano.html Part 2: https://www.datainmotion.dev/2019/07/edge-processing-with-jetson-nano-part-2.html Cloudera Edge Management Top Level NiFi Flow Receiving MiNiFi Agent Messages Overview of our Apache NiFi Flow For Processing We format a new flow file to send to CDSW in JSON to the CDSW Job Environment We Run Apache MXNet 1.3.1 (Java) SSD Against the Web Camera Image Extract The Values From the FlowFile to Send to the Spark Job Our JSON Results From the Logs Log data has successfully arrived, consistent JSON rows are grabbed as they are written to the file We can see the results of the Spark Job in Cloudera Data Science Workbench (CDSW) We can also see messages that we sent to slack

TimothySpann · ‎10-09-2019

Also Cloudera has tools for Hive and other Data replication as part of CDP

TimothySpann · ‎10-09-2019

Cloudera Streams Replication Manager with MirrorMaker 2 solves this easy. But Apache NiFi could do this in a dual ingest fashion, but SRM is a no brainer. Faster, automatic and Active-Active replication with full monitoring. https://blog.cloudera.com/announcing-the-general-availability-of-cloudera-streams-management/

TimothySpann · ‎10-04-2019

Source Code: https://github.com/tspannhw/flume-to-nifi Consume / Publish Kafka And Store to Files, HDFS, Hive 3.1, Kudu Consume Kafka Flow Merge Records And Store As AVRO or ORC Consume Kafka, Update Records via Machine Learning Models In CDSW And Store to Kudu Source: Apache Kafka Topics You enter a few parameters and start ingesting data with or without schemas. Apache Flume had no Schema support. Flume did not support transactions. Sink: Files Storing to files in files systems, object stores, SFTP or elsewhere could not be easier. Choose S3, Local File System, SFTP, HDFS or wherever. Sink: Apache Kudu / Apache Impala Storing to Kudu/Impala (or Parquet for that manner could not be easier with Apache NiFi). Sink: HDFS for Apache ORC Files When completes, the ConvertAvroToORC and PutHDFS build the Hive DDL for you! You can build the tables automagically with Apache NiFi if you wish. CREATE EXTERNAL TABLE IF NOT EXISTS iotsensors (sensor_id BIGINT, sensor_ts BIGINT, is_healthy STRING, response STRING, sensor_0 BIGINT, sensor_1 BIGINT, sensor_2 BIGINT, sensor_3 BIGINT, sensor_4 BIGINT, sensor_5 BIGINT, sensor_6 BIGINT, sensor_7 BIGINT, sensor_8 BIGINT, sensor_9 BIGINT, sensor_10 BIGINT, sensor_11 BIGINT) STORED AS ORC LOCATION '/tmp/iotsensors' Sink: Kafka Publishing to Kafka is just as easy! Push records with schema references or raw data. AVRO or JSON, whatever makes sense for your enterprise. Write to data easily with no coding and no changes or redeploys for schema or schema version changes. Pick a Topic and Stream Data While Converting Types Clean UI and REST API to Manage, Monitor, Configure and Notify on Kafka Other Reasons to Use Apache NiFi Over Apache Flume DevOps with REST API, CLI, Python API https://community.cloudera.com/t5/Community-Articles/More-DevOps-for-HDF-Apache-NiFi-Registry-and-Friends/ta-p/248668 Schemas! We not only work with semi-structured, structured and unstructured data. We are schema and schema version aware for CSV, JSON, AVRO, XML, Grokked Text Files and more. https://community.cloudera.com/t5/Community-Articles/Big-Data-DevOps-Apache-NiFi-HWX-Schema-Registry-Schema/ta-p/247963 Flume Replacement Use Cases Implemented in Apache NiFi Sink/Source: JMS https://community.cloudera.com/t5/Community-Articles/Publishing-and-Consuming-JMS-Messages-from-Tibco-Enterprise/ta-p/248157 Source: Files/PDF/PowerPoint/Excel/Word Sink: Files https://community.cloudera.com/t5/Community-Articles/Parsing-Any-Document-with-Apache-NiFi-1-5-with-Apache-Tika/ta-p/247672 https://community.cloudera.com/t5/Community-Articles/Converting-PowerPoint-Presentations-into-French-from-English/ta-p/248974 https://community.cloudera.com/t5/Community-Articles/Creating-HTML-from-PDF-Excel-and-Word-Documents-using-Apache/ta-p/247968 Source: Files/CSV Sink: HDFS/Hive/Apache ORC https://community.cloudera.com/t5/Community-Articles/Converting-CSV-Files-to-Apache-Hive-Tables-with-Apache-ORC/ta-p/248258 Source: REST/Files/Simulator Sink: HBase, Files, HDFS. ETL with Lookups. https://community.cloudera.com/t5/Community-Articles/ETL-With-Lookups-with-Apache-HBase-and-Apache-NiFi/ta-p/248243 Flume Replacement - Lightweight Open Source Agents If you need to replace local Log to Kafka agents or anything to Kafka or anything to anything with routing, transformation and manipulation. You can use Edge Flow Manager deployed MiNiFi Agents available in Java and C++ versions. References https://www.progress.com/tutorials/jdbc/ingest-salesforce-data-incrementally-into-hive-using-apache-nifi https://community.cloudera.com/t5/Community-Articles/RDBMS-to-Hive-using-NiFi-small-medium-tables/ta-p/244677 https://community.cloudera.com/t5/Community-Articles/My-Year-in-Review-2018/ta-p/249363 https://community.cloudera.com/t5/Community-Articles/My-Year-in-Review-2017/ta-p/247541

TimothySpann · ‎04-25-2019

https://bryanbende.com/development/2015/05/17/collecting-logs-with-apache-nifi http://www.idata.co.il/2017/11/ingesting-log-data-with-nifi-minifi-and-hive/ https://community.hortonworks.com/articles/218844/log-forwarding-ingestion-patterns-using-minifi-and.html https://community.hortonworks.com/articles/67756/ingesting-log-data-using-minifi-nifi.html https://www.keedio.org/real-time-data-ingestion-processing-and-storage-of-syslog-using-apache-nifi-flink-kafka-and-cassandra/

TimothySpann · ‎04-25-2019

MiNiFi Agent either C++ or Java has many advantages over Filebeat including letting you program easily in a GUI using Edge Flow Manager. It also does more than grab logs, it can do analysis, conversion, filtering and more advanced processing. It can also read SYSLOG, databases, MQTT, JMS and a hundred other things. We don't use any ELK stuff, NiFi replaces all of that. NiFi is in production at hundreds of massive companies for enterprise critical applications. NiFi can push to ElasticSearch. You can store all of your data to HDFS and use SOLR on top for searches.

TimothySpann · ‎04-01-2019

https://www.datainmotion.dev/2019/03/advanced-xml-processing-with-apache.html

TimothySpann · ‎03-09-2019

Using Raspberry Pi 3B+ with Apache NiFi MiNiFi and Google Coral Accelerator and Pimoroni Inky Phat First we need to unbox our new goodies. The Inky Phat is an awesome E-Ink display with low power usage that stays displayed after shutdown! Next I added a new Google Coral Edge TPU ML Accelerator USB Coprocessor to a new Raspberry Pi 3B+. This was so easy to integrate and get up and running. Let's unbox this beautiful device (but be careful when it runs it can get really hot and there is a warning in the instructions). So I run this on top of an aluminum case and with a big fan on it. Pimoroni Inky Phat It is pretty easy to set this up and it provides a robust Python library to write to our E-Ink display. You can see an example screen here. https://github.com/pimoroni/inky Pimoroni Inky pHAT ePaper eInk Display in Red Pimoroni Inky Phat (Red) https://shop.pimoroni.com/products/inky-phat https://github.com/pimoroni/inky https://pillow.readthedocs.io/en/stable/reference/ImageDraw.html https://learn.pimoroni.com/tutorial/sandyj/getting-started-with-inky-phat Install Some Python Libraries and Debian Install for Inky PHAT and Coral pip3 install font_fredoka_one pip3 install geocoder pip3 install fswebcam sudo apt-get install fe pip3 install psutil pip3 install font_hanken_grotesk pip3 install font_intuitive wget http://storage.googleapis.com/cloud-iot-edge-pretrained-models/edgetpu_api.tar.gz These libraries are for the Inky, it needs fonts to write. The last TAR is for the Edge device and is a fast install documented well by Google. Download Apache NiFi - MiNiFi Java Agent https://nifi.apache.org/minifi/download.html Next up, the most important piece. You will need to have JDK 8 installed on your device if you are using the Java agent. You can also use the MiniFi C++ Agent but that may require building it for your OS/Platform. That has some interesting Python running abilities. Google Coral Documentation - Google Edge TPU Google Edge TPU ML accelerator coprocessor USB 3.0 Type-C socket Supports Debian Linux on host CPU ASIC designed by Google that provides high performance ML inferencing for TensorFlow Lite models https://coral.withgoogle.com/tutorials/edgetpu-retrain-classification-ondevice/ https://coral.withgoogle.com/tutorials/edgetpu-api/ http://storage.googleapis.com/cloud-iot-edge-pretrained-models/edgetpu_api_reference.zip https://coral.withgoogle.com/web-compiler/ https://coral.withgoogle.com/tutorials/edgetpu-models-intro/ https://coral.withgoogle.com/tutorials/accelerator/ https://coral.withgoogle.com/tutorials/edgetpu-api/ https://coral.withgoogle.com/models/ https://coral.withgoogle.com/tutorials/accelerator-datasheet/ Using Pretrained Tensorflow Lite Model: Inception V4 (ImageNet) Recognizes 1,000 types of objects Dataset: ImageNet Input size: 299x299 Let's run a flow: Results (Once an hour we update our E-Ink Display with Date, IP, Run Time, Label 1) Source Code https://github.com/tspannhw/nifi-minifi-coral References https://medium.freecodecamp.org/building-an-iiot-system-using-apache-nifi-mqtt-and-raspberry-pi-ce1d6ed565bc https://community.hortonworks.com/articles/85984/using-minifi-to-read-data-from-a-sense-hat-on-a-ra.html https://community.hortonworks.com/articles/107379/minifi-for-image-capture-and-ingestion-from-raspbe.html https://community.hortonworks.com/articles/107379/minifi-for-image-capture-and-ingestion-from-raspbe.html https://community.hortonworks.com/articles/32605/running-nifi-on-raspberry-pi-best-practices.html https://www.tensorflow.org/lite/convert/cmdline_examples https://www.tensorflow.org/lite/guide/get_started https://pillow.readthedocs.io/en/stable/reference/ImageDraw.html https://coral.withgoogle.com/tutorials/edgetpu-faq/

TimothySpann · ‎02-13-2019

Use Case IoT Devices with Sensors, Cameras Overview In this, the third of the CDSW series, we build on using CDSW to classify images with a Python Apache MXNet model. In this use case we are receiving edge data from devices running MiniFi Agents that are collecting sensor data, images and also running edge analytics with TensorFlow. An Apache NiFi server collects this data with full data lineage using HTTP calls from the device(s). We then filter, transform, merge and route the sensor data, image data, deep learning analytics data and metadata to different data stores. As part of the flow we upload our images to a cloud hosted FTP server (could be S3 or any media store anywhere) and call a CDSW Model from Apache NiFi via REST and get the model results back as JSON. We are also storing our sensor data in Parquet files in HDFS. We then trigger a PySpark job from CDSW via API from Apache NiFi and check the status of that. We store the status result data in Parquet as well for PySpark SQL analysis. For additional steps we can join together the image and sensor data via image name and do additional queries, reports and dashboards. We can also route this data to Apache Kafka for downstream analysis in Kafka Streams, Storm, Spark Streaming or SAM. Part 1: https://community.hortonworks.com/content/kbentry/239858/integrating-machine-learning-models-into-your-big.html Part 2: https://community.hortonworks.com/content/kbentry/239961/using-cloudera-data-science-workbench-with-apache.html Summary MiniFi Java Agents read sensor values and feed them to Apache NiFi via HTTPS with full data provenance and lineage. Apache NiFi acts as master orchestrator conducting, filtering, transforming, converting, querying, aggregating, routing and cleansing the streams. As part of the flow we call Cloudera Data Science Workbench via REST API to classify ingested images via an Apache MXNet Python GluonCV Yolo model. We also call a Spark job to process ingested Parquet files stored in HDFS loaded from the related sensor and metadata. The Pyspark jobs are triggered from Apache NiFi via REST API calls to Cloudera Data Science Workbench's jobs api. For this particular integration I am using a self-built Apache 1.9, Apache NiFi - MiniFi Java Agent 0.5.0, Cloudera Data Science Workbench 1.5 for HDP, HDFS, Apache Spark 2, Python 3, PySpark and Parquet. Overall Apache NiFi Flow Workflow walk-thru For Images, we transmit the images to an FTP server, run them through an Inception classifier (TensorFlow NiFi Processor) and extract those results plus metadata for future uses. For Sensor Data, we merge it, convert to Parquet and store the files. We also store it to HBase and send alerts to a slack channel. When we are complete we trigger an Apache Spark PySpark SQL job via CDSW. This job can email us a report and has nice dashboards to see your job run. We also clean up, filter, flatten and merge with JSON status as Parquet files for future analysis with PySpark SQL. We must set Content-Type for application/json, send an empty message body, no chunk encoding and you can turn on Always Output response. We need to cleanup and remove some fields from the status returned. Jolt works magic on JSON. Setting up FTP is easy. Here is what some of the sensor data looks like while in motion. We setup a job in CDSW very easily from an existing Python file. After we have run the job a few times we get a nice graph of run duration for our Job history. You can see details of the run including the session and the results. When the job is running you can see it in process and all the completed runs. We can query our data with Pyspark Dataframes for simple output. we can display the schema. We can use Pandas for a nicer table display of the data. Load Data Manually We can have Apache NiFi push to HDFS directly for us. To load data manually in Cloudera DSW after uploading the files to a directory in CDSW: # To Load Data Created By niFi !hdfs dfs -mkdir /tmp/status !hdfs dfs -put status/*.parquet /tmp/status !hdfs dfs -ls /tmp/status!hdfs dfs -mkdir /tmp/sensors !hdfs dfs -put sensors/*.parquet /tmp/sensors !hdfs dfs -ls /tmp/sensors Source Code https://github.com/tspannhw/nifi-cdsw-edge Jolt To Cleanup CDSW Status JSON [{ "operation": "shift", "spec": { "*.*": "&(0,1)_&(0,2)", "*.*.*": "&(0,1)_&(0,2)_&(0,3)", "*.*.*.*": "&(0,1)_&(0,2)_&(0,3)_&(0,4)", "*": "&" } }, { "operation": "remove", "spec": { "environment": "", "environment*": "", "latest_k8s": "", "report_attachments": "" }}] We remove the arrays, remove some unwanted fields and flatten the data for easy querying. We then convert to Apache Avro and store as Apache Parquet files for querying with Pyspark. URL to Start a Cloudera Data Science Workbench Job http://cdsw/api/v1/projects/tspann/future-of-data-meetup-princeton-12-feb-2019/jobs/4/start as Per: http://cdsw.example.com/api/v1/projects/<$USERNAME>/<$PROJECT_NAME>/jobs/<$JOB_ID>/start What Does the IoT Data Look Like? { "uuid" : "20190213043439_e58bee05-142b-4b7e-a28b-fec0305ab125", "BH1745_clear" : "0.0", "te" : "601.1575453281403", "host" : "piups", "BH1745_blue" : "0.0", "imgname" : "/opt/demo/images/bog_image_20190213043439_e58bee05-142b-4b7e-a28b-fec0305ab125.jpg", "lsm303d_accelerometer" : "+00.08g : -01.01g : +00.09g", "cputemp" : 44, "systemtime" : "02/12/2019 23:34:39", "memory" : 45.7, "bme680_tempc" : "23.97", "imgnamep" : "/opt/demo/images/bog_image_p_20190213043439_e58bee05-142b-4b7e-a28b-fec0305ab125.jpg", "bme680_pressure" : "1000.91", "BH1745_red" : "0.0", "bme680_tempf" : "75.15", "diskusage" : "9622.5", "ltr559_lux" : "000.00", "bme680_humidity" : "24.678", "lsm303d_magnetometer" : "+00.03 : +00.42 : -00.11", "BH1745_green" : "0.0", "ipaddress" : "192.168.1.166", "starttime" : "02/12/2019 23:24:38", "ltr559_prox" : "0000", "VL53L1X_distance_in_mm" : 553, "end" : "1550032479.3900714" } What Does the TensorFlow Image Analytics Data Look Like? {"probability_4":"2.00%","file.group":"root", "s2s.address":"192.168.1.166:60966", "probability_5":"1.90%","file.lastModifiedTime":"2019-02-12T18:02:21-0500", "probability_2":"3.14%","probability_3":"2.35%","probability_1":"3.40%", "file.permissions":"rw-r--r--","uuid":"0596aa5f-325b-4bd2-ae80-6c7561c8c056", "absolute.path":"/opt/demo/images/","path":"/","label_5":"fountain", "label_4":"lampshade","filename":"bog_image_20190212230221_00c846a7-b8d2-4192-b8eb-f6f13268483c.jpg", "label_3":"breastplate","s2s.host":"192.168.1.166","file.creationTime":"2019-02-12T18:02:21-0500", "file.lastAccessTime":"2019-02-12T18:02:21-0500", "file.owner":"root", "label_2":"spotlight", "label_1":"coffeepot", "RouteOnAttribute.Route":"isImage"} Transformed Job Status Data { "id" : 4, "name" : "Pyspark SQL Job", "script" : "pysparksqljob.py", "cpu" : 2, "memory" : 4, "nvidia_gpu" : 0, "engine_image_id" : 7, "kernel" : "python3", "englishSchedule" : "", "timezone" : "America/New_York", "total_runs" : 108, "total_failures" : 0, "paused" : false, "type" : "manual", "creator_id" : 19, "creator_username" : "tspann", "creator_name" : "Timothy Spann", "creator_email" : "tspann@EmailIsland.Space", "creator_url" : "http://cdsw-hdp-3/api/v1/users/tspann", "creator_html_url" : "http://cdsw-hdp-3/tspann", "project_id" : 30, "project_slug" : "tspann/future-of-data-meetup-princeton-12-feb-2019", "project_name" : "Future of Data Meetup Princeton 12 Feb 2019", "project_owner_id" : 19, "project_owner_username" : "tspann", "project_owner_email" : "tspann@email.tu", "project_owner_name" : "Timothy Spann", "project_owner_url" : "http://cdsw-hdp-3/api/v1/users/tspann", "project_owner_html_url" : "http://cdsw-hdp/tspann", "project_url" : "http://cdsw-hdp-3/api/v1/projects/tspann/future-of-data-meetup-princeton-12-feb-2019", "project_html_url" : "http://cdsw-hdp-3/tspann/future-of-data-meetup-princeton-12-feb-2019", "latest_id" : "jq47droa9zv9ou0j", "latest_batch" : true, "latest_job_id" : 4, "latest_status" : "scheduling", "latest_oomKilled" : false, "latest_created_at" : "2019-02-13T13:04:28.961Z", "latest_scheduling_at" : "2019-02-13T13:04:28.961Z", "latest_url" : "http://server/api/v1/projects/tspann/future-of-data-meetup-princeton-12-feb-2019/dashboards/jq47droa9zv9ou0j", "latest_html_url" : "http://server/tspann/future-of-data-meetup-princeton-12-feb-2019/engines/jq47droa9zv9ou0j", "latest_shared_view_visibility" : "private", "report_include_logs" : true, "report_send_from_creator" : false, "timeout" : 30, "timeout_kill" : false, "created_at" : "2019-02-13T04:46:26.597Z", "updated_at" : "2019-02-13T04:46:26.597Z", "shared_view_visibility" : "private", "url" : "http://serverapi/v1/projects/tspann/future-of-data-meetup-princeton-12-feb-2019/jobs/4", "html_url" : "http://server/tspann/future-of-data-meetup-princeton-12-feb-2019/jobs/4", "engine_id" : "jq47droa9zv9ou0j" } PySpark Sensor Spark SQL for Data Analysis from __future__ import print_function import pandas as pd import sys, re from operator import add from pyspark.sql import SparkSession pd.options.display.html.table_schema = True spark = SparkSession .builder .appName("Sensors") .getOrCreate() # Access the parquet sensor = spark.read.parquet("/tmp/sensors/*.parquet") data = sensor.toPandas() pd.DataFrame(data) spark.stop() PySpark Status Spark SQL for Data Analysis from __future__ import print_function import pandas as pd import sys, re from operator import add from pyspark.sql import SparkSession pd.options.display.html.table_schema = True spark = SparkSession\ .builder\ .appName("Status")\ .getOrCreate() # Access the parquet sensor = spark.read.parquet("/tmp/status/*.parquet") # show content sensor.show() # query # sensor.select(sensor['bme680_humidity'], sensor['bme680_tempf'], sensor['lsm303d_magnetometer']).show() sensor.printSchema()sensor.count() data = sensor.toPandas()pd.DataFrame(data) spark.stop() Status Schema (jobstatus) { "type":"record", "name":"jobstatus", "fields":[ { "name":"id", "type":["int","null"] }, { "name":"name", "type":["string","null"] }, { "name":"script", "type":["string","null"] }, { "name":"cpu", "type":["int","null"] }, { "name":"memory", "type":["int","null"] }, { "name":"nvidia_gpu", "type":["int","null"] }, { "name":"engine_image_id", "type":["int","null"] }, { "name":"kernel", "type":["string","null"] }, { "name":"englishSchedule", "type":["string","null"] }, { "name":"timezone", "type":["string","null"] }, { "name":"total_runs", "type":["int","null"] }, { "name":"total_failures", "type":["int","null"], "doc":"Type inferred from '0'" }, { "name":"paused", "type":["boolean","null"], "doc":"Type inferred from 'false'" }, { "name":"type", "type":["string","null"], "doc":"Type inferred from '\"manual\"'" }, { "name":"creator_id", "type":["int","null"], "doc":"Type inferred from '19'" }, { "name":"creator_username", "type":["string","null"] }, { "name":"creator_name", "type":["string","null"] }, { "name":"creator_email", "type":["string","null"] }, { "name":"creator_url", "type":["string","null"] }, { "name":"creator_html_url", "type":["string","null"] }, { "name":"project_id", "type":["int","null"] }, { "name":"project_slug", "type":["string","null"] }, { "name":"project_name", "type":["string","null"] }, { "name":"project_owner_id", "type":["int","null"] }, { "name":"project_owner_username", "type":["string","null"] }, { "name":"project_owner_email", "type":["string","null"] }, { "name":"project_owner_name", "type":["string","null"] }, { "name":"project_owner_url", "type":["string","null"] }, { "name":"project_owner_html_url", "type":["string","null"] }, { "name":"project_url", "type":["string","null"] }, { "name":"project_html_url", "type":["string","null"] }, { "name":"latest_id", "type":["string","null"] }, { "name":"latest_batch", "type":["boolean","null"] }, { "name":"latest_job_id", "type":["int","null"] }, { "name":"latest_status", "type":["string","null"] }, { "name":"latest_oomKilled", "type":["boolean","null"] }, { "name":"latest_created_at", "type":["string","null"] }, { "name":"latest_scheduling_at", "type":["string","null"] }, { "name":"latest_url", "type":["string","null"] }, { "name":"latest_html_url", "type":["string","null"] }, { "name":"latest_shared_view_visibility", "type":["string","null"] }, { "name":"report_include_logs", "type":["boolean","null"] }, { "name":"report_send_from_creator", "type":["boolean","null"] }, { "name":"timeout", "type":["int","null"] }, { "name":"timeout_kill", "type":["boolean","null"] }, { "name":"created_at", "type":["string","null"] }, { "name":"updated_at", "type":["string","null"] }, { "name":"shared_view_visibility", "type":["string","null"] }, { "name":"url", "type":["string","null"] }, { "name":"html_url", "type":["string","null"] }, { "name":"engine_id", "type":["string","null"] } ] } Documentation https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_rest_apis.html#cdsw_api References https://community.hortonworks.com/articles/229522/iot-series-sensors-utilizing-breakout-garden-hat-p.html https://community.hortonworks.com/articles/232136/iot-series-sensors-utilizing-breakout-garden-hat-p-1.html Join Me in March at Data Works Summit in Barcelona. Or In Princeton Monthly.

TimothySpann · ‎02-10-2019

This is an update to a previous article on accessing Philadelphia Open Crime Data and storing it in Apache Phoenix on HBase. Part 1: https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html For nosql summit https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html Update this http://princeton0.field.hortonworks.com:16010/master-status https://community.hortonworks.com/content/kbentry/54947/reading-opendata-json-and-storing-into-phoenix-tab.html Philly Crime Data City of Philly App Token 76MVJDcTksxeS1uYPf8D0XdUF Secret Token WZnlB_YJ5r9rjj_alWVdc_yqnxRpnIk5BHgb crime https://data.phila.gov/resource/sspu-uyfa.json?$$app_token=76MVJDcTksxeS1uYPf8D0XdUF&$limit=5000 https://www.opendataphilly.org/dataset/crime-incidents/resource/f6e5998d-9d33-4a45-8397-3a6bb8607d10 https://www.opendataphilly.org/dataset/crime-incidents https://data.phila.gov/resource/sspu-uyfa.json https://data.phila.gov/resource/sspu-uyfa.json?$$app_token=76MVJDcTksxeS1uYPf8D0XdUF&$limit=5000&dispatch_date=2019-02-18 "dc_dist":"18", "dc_key":"200918067518", "dispatch_date":"2009-10-02", "dispatch_date_time":"2009-10-02T14:24:00.000", "dispatch_time":"14:24:00", "hour":"14", "location_block":"S 38TH ST / MARKETUT ST", "psa":"3", "text_general_code":"Other Assaults", "ucr_general":"800"} CREATE EXTERNAL TABLE crime (dc_dist STRING, dc_key STRING,dispatch_date STRING,dispatch_date_time STRING,hour STRING,location_block STRING,psa STRING, text_general_code STRING,ucr_general STRING) CLUSTERED BY (psa)INTO 4 BUCKETS ROW FORMAT DELIMITED STORED AS ORC LOCATION '/crime/hive' TBLPROPERTIES('transactional'='true'); dc_dist,dc_key ,dispatch_date,dispatch_date_time,hour,location_block,psa,text_general_code,ucr_general Today https://data.phila.gov/resource/sspu-uyfa.json?$$app_token=76MVJDcTksxeS1uYPf8D0XdUF&$limit=5000&dispatch_date=2016-09-12 &dispatch_date=${now():format('yyyy-MM-dd')} 311 service https://data.phila.gov/resource/4t9v-rppq.json?$$app_token=76MVJDcTksxeS1uYPf8D0XdUF&$limit=5000 http://arcgis.dvrpc.org/arcgis/rest/services/Transportation/PedestrianCounts/FeatureServer/1 http://arcgis.dvrpc.org/arcgis/rest/services/Transportation/TrafficCounts/FeatureServer/0 see: https://community.hortonworks.com/articles/52856/stream-data-into-hive-like-a-king-using-nifi.html Perhaps adapt this https://github.com/socrata/soda-java https://www.opendataphilly.org/ create table crime hdfs dfs -mkdir -p /crime/fail hdfs dfs -mkdir -p /crime https://phoenix.apache.org/faq.html /usr/hdp/current/phoenix-client/bin/sqlline.py localhost:2181:/hbase-unsecure CREATE TABLE phillycrime (dc_dist varchar, dc_key varchar not null primary key,dispatch_date varchar,dispatch_date_time varchar,dispatch_time varchar,hour varchar,location_block varchar,psa varchar, text_general_code varchar,ucr_general varchar); {"dc_dist":"18","dc_key":"200918067518","dispatch_date":"2009-10-02","dispatch_date_time":"2009-10-02T14:24:00.000","dispatch_time":"14:24:00","hour":"14","location_block":"S 38TH ST / MARKETUT ST","psa":"3","text_general_code":"Other Assaults","ucr_general":"800"} upsert into phillycrime values ('18', '200918067518', '2009-10-02','2009-10-02T14:24:00.000','14:24:00','14', 'S 38TH ST / MARKETUT ST','3','Other Assaults','800'); !tables !describe phillycrime upsert into test values org.apache.phoenix.jdbc.PhoenixDriver jdbc:phoenix:localhost:2181:/hbase-unsecure /usr/hdp/2.4.0.0-169/phoenix/phoenix-client.jar /usr/hdp/2.4.0.0-169/hbase/lib/hbase-client.jar /etc/hbase/conf/hbase-site.xml plus hadoop ones https://community.hortonworks.com/articles/19016/connect-to-phoenix-hbase-using-dbvisualizer.html 0: jdbc:phoenix:localhost:2181:/hbase-unsecur> !describe phillycrime +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+----------------+ | TABLE_CAT | TABLE_SCHEM | TABLE_NAME | COLUMN_NAME | | +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+----------------+ | | | PHILLYCRIME | DC_DIST | 12 | | | | PHILLYCRIME | DC_KEY | 12 | | | | PHILLYCRIME | DISPATCH_DATE | 12 | | | | PHILLYCRIME | DISPATCH_DATE_TIME | 12 | | | | PHILLYCRIME | HOUR | 12 | | | | PHILLYCRIME | LOCATION_BLOCK | 12 | | | | PHILLYCRIME | PSA | 12 | | | | PHILLYCRIME | TEXT_GENERAL_CODE | 12 | | | | PHILLYCRIME | UCR_GENERAL | 12 | +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+----------------+ 0: jdbc:phoenix:localhost:2181:/hbase-unsecur> 0: jdbc:phoenix:localhost:2181:/hbase-unsecur> upsert into phillycrime values ('18', '200918067518', '2009-10-02','2009-10-02T14:24:00.000','14', 'S 38TH ST / MARKETUT ST','3','Other Assaults','800'); 1 row affected (0.115 seconds) 0: jdbc:phoenix:localhost:2181:/hbase-unsecur> select * from phillycrime; +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+----------------+ | DC_DIST | DC_KEY | DISPATCH_DATE | DISPATCH_DATE_TIME | | +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+----------------+ | 18 | 200918067518 | 2009-10-02 | 2009-10-02T14:24:00.000 | 14 | +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+----------------+ 1 row selected (0.197 seconds) upsert into phillycrime values ('${'dc_dist'}','${'dc_key'}','${'dispatch_date'}','${'dispatch_date_time'}','${'dispatch_time'}','${'hour'}','${'location_block'}','${'psa'}','${'text_general_code'}','${'ucr_general'}') jdbc:phoenix:localhost:2181:/hbase-unsecur> select distinct(text_general_code) from phillycrime; https://data.phila.gov/resource/sspu-uyfa.json?$$$app_token=76MVJDcTksxeS1uYPf8D0XdUF&$$limit=5000&dispatch_date=${now():format('yyyy-MM-dd')} <value>phillycrime${now():toNumber()}.json</value> upsert into phillycrime values ('${'dc_dist'}','${'dc_key'}','${'dispatch_date'}','${'dispatch_date_time'}','${'dispatch_time'}','${'hour'}','${'location_block'}','${'psa'}','${'text_general_code'}','${'ucr_general'}') <name>URL</name> <value>https://data.phila.gov/resource/sspu-uyfa.json?$$$app_token=76MVJDcTksxeS1uYPf8D0XdUF&$$limit=5000&dispatch_date=${now():format('yyyy-MM-dd')}</value> </property> <property> <name>Filename</name> <value>phillycrime${now():toNumber()}.json</value> </property> <name>Regular Expression</name> <value>(?s:^.*$)</value> </property> <property> <name>Replacement Value</name> <value>upsert into phillycrime values ('${'dc_dist'}','${'dc_key'}','${'dispatch_date'}','${'dispatch_date_time'}','${'dispatch_time'}','${'hour'}','${'location_block'}','${'psa'}','${'text_general_code'}','${'ucr_general'}')</value> <comment/> <class>org.apache.nifi.dbcp.DBCPConnectionPool</class> <enabled>true</enabled> <property> <name>Database Connection URL</name> <value>jdbc:phoenix:localhost:2181:/hbase-unsecure</value> </property> <property> <name>Database Driver Class Name</name> <value>org.apache.phoenix.jdbc.PhoenixDriver</value> </property> <property> <name>Database Driver Jar Url</name> <value>file:///usr/hdp/2.4.0.0-169/phoenix/phoenix-client.jar</value> https://data.phila.gov/resource/sspu-uyfa.json https://www.opendataphilly.org/dataset/crime-incidents Run spring boot … https://www.opendataphilly.org/dataset/crime-incidents https://community.hortonworks.com/articles/72420/ingesting-remote-sensor-feeds-into-apache-phoenix.html https://github.com/tspannhw/phoenix org.apache.phoenix.jdbc.PhoenixDriver jdbc:phoenix:princeton0.field.hortonworks.com:/hbase-unsecure phoenixuser

Online	Offline
Last Visited	‎05-20-2024 05:42 PM

Member Since	‎01-07-2019 11:58 AM
Last Visited	‎05-20-2024 05:42 PM
Posts	1,973
Kudos received	1122

Cloudera Community

Re: Has anyone tried NiFi consuming (JMSConsume) f...

Re: NiFi Crash after runing chain of lookups

Re: Recommend approach for listening to RSS Feed i...

Re: NiFi ListenFTP Processor Default Data Port

Re: Nifi: Kafka Producer with Avro format in both ...

Edge Data Processing with Jetson Nano Part 3 - AI ...

Re: Full Disaster Recovery with Multiple On-Premis...

Re: Full Disaster Recovery with Multiple On-Premis...

Migrating Apache Flume Flows to Apache NiFi: Kafka...

Re: Using NiFi instead of Logstash

Re: Using NiFi instead of Logstash

Re: NiFi XML to JSON shredding: A generalised sol...

Using Raspberry Pi 3B+ with Apache NiFi MiNiFi and...

Edge to AI: IoT Sensors and Images Streaming Ing...

Tracking Crime As It is Reported via Apache NiFi