1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2486 | 04-03-2024 06:39 AM | |
| 3840 | 01-12-2024 08:19 AM | |
| 2078 | 12-07-2023 01:49 PM | |
| 3062 | 08-02-2023 07:30 AM | |
| 4195 | 03-29-2023 01:22 PM |
07-14-2018
09:29 PM
3 Kudos
Scanning Documents into Data Lakes via Tesseract, Python, OpenCV and Apache NiFi Source: https://github.com/tspannhw/nifi-tesseract-python There are many awesome open source tools available to integrate with your Big Data Streaming flows. Take a look at these articles for installation and why the new version of Tesseract is different. I am officially recommending Python 3.6 or newer. Please don't use Python 2.7 if you don't have to. Friends don't let friends use old Python. Tesseract 4 with Deep Learning https://www.learnopencv.com/deep-learning-based-text-recognition-ocr-using-tesseract-and-opencv/ Github: https://github.com/spmallick/learnopencv/tree/master/OCR For installation on a Mac Laptop: brew install tesseract --HEAD
pip3.6 install pytesseract
brew install leptonica Note: if you have tesseract already, you may need to uninstall and unlink it first with brew. If you don't use brew, you can install another way. Summary
Execute the run.sh (https://github.com/tspannhw/nifi-tesseract-python/blob/master/pytesstest.py) . It will send a MQTT message of the text and some other attributes in JSON format to the tesseract topic in the specified MQTT broker. Apache NiFi will read from this topic via ConsumeMQTT The flow checks to see if it's valid JSON via RouteOnContent. We run MergeRecord to convert a bunch of JSON into one big Apache Avro File Then we run ConvertAvroToORC to make a superfast Apache ORC file for storage Then we store it in HDFS via PutHDFS Running The Python Script You could have this also hooked up to a scanner or point it at a directory. You could also have it scheduled to run every 30 seconds or so. I had this hooked up to a local Apache NiFi instance to schedule runs. This can also be run by MiniFi Java Agent or MiniFi C++ agent. Or on demand if you wish. Sending MQTT Messages From Python # MQTT
client = mqtt.Client()
client.username_pw_set("user","pass")
client.connect("server.server.com", 17769, 60)
client.publish("tesseract", payload=json_string, qos=0, retain=True) You will need to run: pip3 install paho-mqtt Create the HDFS Directory hdfs dfs -mkdir -p /tesseract
Create the External Hive Table (DDL Built by NiFi) CREATE EXTERNAL TABLE IF NOT EXISTS tesseract (`text` STRING, imgname STRING, host STRING, `end` STRING, te STRING, battery INT, systemtime STRING, cpu DOUBLE, diskusage STRING, memory DOUBLE, id STRING) STORED AS ORC
LOCATION '/tesseract';
This DDL is a side effect, it's built by our ORC conversion and HDFS storage commands. You could run that create script in Hive View 2, Beeline or another Apache Hive JDBC/ODBC tool. I used Apache Zeppelin since I am going to be doing queries there anyway. Let's Ingest Our Captured Images and Process Them with Apache Tika, TensorFlow and grab the metadata Consume MQTT Records and Store in Apache Hive Let's look at other fields in Zeppelin Let's Look at Our Records in Apache Zeppelin via a SQL Query (SELECT *FROM TESSERACT) ConsumeMQTT: Give me all the record from the tesseract topic from our MQTT Broker. Isolation from our ingest clients which could be 100,000 devices. MergeRecord: Merge all the JSON files sent via MQTT into one big AVRO File ConvertAVROToORC: converts are merged AVRO file PutHDFS Tesseract Example Schema in Hortonworks Schema Registry TIP: You can generate your schema with InferAvroSchema. Do that once, copy it and paste into Schema Registry. Then you can remove that step from your flow. The Schema Text {
"type": "record",
"name": "tesseract",
"fields": [
{
"name": "text",
"type": "string",
"doc": "Type inferred from '\"cgi cctong aiternacrety, pou can acces the complete Pro\nLance repesiiry from eh Provenance mens: The Provenance\n‘emu inchades the Date/Time, Actontype, the Unsque Fowie\nTD and other sata. Om the ar it is smal exci i oe:\n‘ick chs icon, and you get the flowin On the right, war\n‘cots like three inthe cic soemecaed gether Liege:\n\nLineage ts visualined as « lange direcnad sqycie graph (DAG) char\nSrones the seeps 1m she Gow where modifications oF routing ‘oot\nplace on the Aewiike. Righe-iieit « step lp the Lineage s view\nSetusls aboot the fowtle at that step ar expand the ow to ander:\nScand where & was potentially domed frum. Af the very bottom\nleft of the Lineage Oi a slider wath a play button to play the pro\n“sing flow (with scaled ame} and understand where tbe owtise\nSpent the meat Game of at whch PORN get muted\n\naide the Bowtie dealin, you cam: finn deed analy of box\n\ntern\n=\"'"
},
{
"name": "imgname",
"type": "string",
"doc": "Type inferred from '\"images/tesseract_image_20180613205132_c14779b8-1546-433e-8976-ddb5bfc5f978.jpg\"'"
},
{
"name": "host",
"type": "string",
"doc": "Type inferred from '\"HW13125.local\"'"
},
{
"name": "end",
"type": "string",
"doc": "Type inferred from '\"1528923095.3205361\"'"
},
{
"name": "te",
"type": "string",
"doc": "Type inferred from '\"3.7366552352905273\"'"
},
{
"name": "battery",
"type": "int",
"doc": "Type inferred from '100'"
},
{
"name": "systemtime",
"type": "string",
"doc": "Type inferred from '\"06/13/2018 16:51:35\"'"
},
{
"name": "cpu",
"type": "double",
"doc": "Type inferred from '22.8'"
},
{
"name": "diskusage",
"type": "string",
"doc": "Type inferred from '\"113759.7 MB\"'"
},
{
"name": "memory",
"type": "double",
"doc": "Type inferred from '69.4'"
},
{
"name": "id",
"type": "string",
"doc": "Type inferred from '\"20180613205132_c14779b8-1546-433e-8976-ddb5bfc5f978\"'"
}
]
} The above schema was generated by Infer Avro Schema in Apache NiFi. Image Analytics Results {
"tiffImageWidth" : "1280",
"ContentType" : "image/jpeg",
"JPEGImageWidth" : "1280 pixels",
"FileTypeDetectedFileTypeName" : "JPEG",
"tiffBitsPerSample" : "8",
"ThumbnailHeightPixels" : "0",
"label4" : "book jacket",
"YResolution" : "1 dot",
"label5" : "pill bottle",
"ImageWidth" : "1280 pixels",
"JFIFYResolution" : "1 dot",
"JPEGImageHeight" : "720 pixels",
"filecreationTime" : "2018-06-13T17:24:07-0400",
"JFIFThumbnailHeightPixels" : "0",
"DataPrecision" : "8 bits",
"XResolution" : "1 dot",
"ImageHeight" : "720 pixels",
"JPEGNumberofComponents" : "3",
"JFIFXResolution" : "1 dot",
"FileTypeExpectedFileNameExtension" : "jpg",
"JPEGDataPrecision" : "8 bits",
"FileSize" : "223716 bytes",
"probability4" : "1.74%",
"tiffImageLength" : "720",
"probability3" : "3.29%",
"probability2" : "6.13%",
"probability1" : "81.23%",
"FileName" : "apache-tika-2858986094088526803.tmp",
"filelastAccessTime" : "2018-06-13T17:24:07-0400",
"JFIFThumbnailWidthPixels" : "0",
"JPEGCompressionType" : "Baseline",
"JFIFVersion" : "1.1",
"filesize" : "223716",
"FileModifiedDate" : "Wed Jun 13 17:24:27 -04:00 2018",
"Component3" : "Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert",
"Component1" : "Y component: Quantization table 0, Sampling factors 2 horiz/2 vert",
"Component2" : "Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert",
"NumberofTables" : "4 Huffman tables",
"FileTypeDetectedFileTypeLongName" : "Joint Photographic Experts Group",
"fileowner" : "tspann",
"filepermissions" : "rw-r--r--",
"JPEGComponent3" : "Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert",
"JPEGComponent2" : "Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert",
"JPEGComponent1" : "Y component: Quantization table 0, Sampling factors 2 horiz/2 vert",
"FileTypeDetectedMIMEType" : "image/jpeg",
"NumberofComponents" : "3",
"HuffmanNumberofTables" : "4 Huffman tables",
"label1" : "menu",
"XParsedBy" : "org.apache.tika.parser.DefaultParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.jpeg.JpegParser",
"label2" : "web site",
"label3" : "crossword puzzle",
"absolutepath" : "/Volumes/seagate/opensourcecomputervision/images/",
"filelastModifiedTime" : "2018-06-13T17:24:07-0400",
"ThumbnailWidthPixels" : "0",
"filegroup" : "staff",
"ResolutionUnits" : "none",
"JFIFResolutionUnits" : "none",
"CompressionType" : "Baseline",
"probability5" : "1.12%"
}
This is built using a combination of Apache Tika, TensorFlow and other metadata analysis processors.
... View more
Labels:
07-11-2018
02:24 PM
2 Kudos
Capture Images from PicSum.com Free Images Process All the Images via TensorFlow Processor, SSD Predict via MMS and SqueezeNet v1.1 via MMS Apache Zeppelin SQL Against tblsqueeze11 Example Output from Squeeze v1.1 Storing Generic Data in HDFS via Schema Example SSD Data JSON High Level Flow From Server Apache NiFi Server Flows to Store Convert to Apache ORC Extract Attributes Convert JSON Arrays to Other Example Data Derived From TensorFlow Processor Schemas in Schema Registry Create Table in Zeppelin Query Table in Zeppelin Python Libraries git clone https://github.com/awslabs/mxnet-model-server.git
pip install opencv-python -U
pip install scikit-learn -U
pip install easydict -U
pip install scikit-image -U
pip install numpy -U
pip install mxnet -U
pip3.6 install opencv-python -U
pip3.6 install scikit-learn -U
pip3.6 install easydict -U
pip3.6 install scikit-image -U
pip3.6 install numpy -U
pip3.6 install mxnet -U Example Runs - Squeeze v1.1 mxnet-model-server --models squeezenet=squeezenet_v1.1.model --service mms/model_service/mxnet_vision_service.py --port 9999
[INFO 2018-07-10 16:50:26,840 PID:7730 /usr/local/lib/python3.6/site-packages/mms/request_handler/flask_handler.py:jsonify:159] Jsonifying the response: {'prediction': [[{'probability': 0.3365139067173004, 'class': 'n03710193 mailbox, letter box'}, {'probability': 0.1522996574640274, 'class': 'n03764736 milk can'}, {'probability': 0.08760709315538406, 'class': 'n03000134 chainlink fence'}, {'probability': 0.08103135228157043, 'class': 'n02747177 ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin'}, {'probability': 0.04956872761249542, 'class': 'n02795169 barrel, cask'}]]}
[INFO 2018-07-10 16:50:26,842 PID:7730 /usr/local/lib/python3.6/site-packages/werkzeug/_internal.py:_log:88] 127.0.0.1 - - [10/Jul/2018 16:50:26] "POST /squeezenet/predict HTTP/1.1" 200 -
[INFO 2018-07-10 16:50:46,904 PID:7730 /usr/local/lib/python3.6/site-packages/mms/serving_frontend.py:predict_callback:467] Request input: data should be image with jpeg format.
[INFO 2018-07-10 16:50:46,960 PID:7730 /usr/local/lib/python3.6/site-packages/mms/request_handler/flask_handler.py:get_file_data:137] Getting file data from request.
[INFO 2018-07-10 16:50:47,020 PID:7730 /usr/local/lib/python3.6/site-packages/mms/serving_frontend.py:predict_callback:510] Response is text.
[INFO 2018-07-10 16:50:47,020 PID:7730 /usr/local/lib/python3.6/site-packages/mms/request_handler/flask_handler.py:jsonify:159] Jsonifying the response: {'prediction': [[{'probability': 0.1060439869761467, 'class': 'n02536864 coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch'}, {'probability': 0.06582894921302795, 'class': 'n01930112 nematode, nematode worm, roundworm'}, {'probability': 0.05008145794272423, 'class': 'n01751748 sea snake'}, {'probability': 0.03847070038318634, 'class': 'n01737021 water snake'}, {'probability': 0.03614763543009758, 'class': 'n09229709 bubble'}]]}
[INFO 2018-07-10 16:50:47,021 PID:7730 /usr/local/lib/python3.6/site-packages/werkzeug/_internal.py:_log:88] 127.0.0.1 - - [10/Jul/2018 16:50:47] "POST /squeezenet/predict HTTP/1.1" 200 -
mxnet-model-server --models SSD=resnet50_ssd_model.model --service ssd_service.py --port 9998
Apache MXNet Model Server Model Zoo https://github.com/awslabs/mxnet-model-server/blob/master/docs/model_zoo.md Connect to MMS /opt/demo/curl.sh
curl -X POST http://127.0.0.1:9998/SSD/predict -F "data=@$1" 2>/dev/null
/opt/demo/curl2.sh
curl -X POST http://127.0.0.1:9999/squeezenet/predict -F "data=@$1" 2>/dev/null
Flows mxnetserverlocal.xml mxnetmodelserver.xml Reference
https://community.hortonworks.com/articles/155435/using-the-new-mxnet-model-server.html https://community.hortonworks.com/articles/177232/apache-deep-learning-101-processing-apache-mxnet-m.html https://mxnet.incubator.apache.org/model_zoo/ https://medium.com/apache-mxnet/mxnet-1-2-adds-built-in-support-for-onnx-e2c7450ffc28 https://mxnet.incubator.apache.org/api/python/gluon/model_zoo.html https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data https://github.com/onnx/models https://github.com/awslabs/mxnet-model-server/blob/master/docs/model_zoo.md#lstm-ptb https://github.com/awslabs/mxnet-model-server/blob/master/docs/model_zoo.md#arcface-resnet100_onnx
... View more
Labels:
06-29-2018
09:27 PM
3 Kudos
Ingesting Blockchain data from btc.com and blockchain.com(.info). Like everything in Apache NiFi, this is trivial easy to ingest all these different feeds, process them, route them and store them for SQL access. API: blockchain.com https://blockchain.info/latestblock https://api.blockchain.info/charts/transactions-per-second?timespan=5weeks&rollingAverage=8hours&format=json https://blockchain.info/blocks/BTC.com?format=json https://api.blockchain.info/pools?timespan=5days https://api.blockchain.info/stats https://blockchain.info/ticker https://blockchain.info/tobtc?currency=USD&value=10000 Blocks For Today https://chain.api.btc.com/v3/block/date/${now():format('yyyyMMdd')} btc.com https://chain.api.btc.com/v3/block/latest https://chain.api.btc.com/v3/tx/unconfirmed Blockchain Stats Call {
"timestamp": 1.528924637E12,
"market_price_usd": 6284.846,
"hash_rate": 3.4875735626794174E10,
"total_fees_btc": 3517243180,
"n_btc_mined": 177500000000,
"n_tx": 208694,
"n_blocks_mined": 142,
"minutes_between_blocks": 9.5603,
"totalbc": 1709147500000000,
"n_blocks_total": 527318,
"estimated_transaction_volume_usd": 9.726263100037038E8,
"blocks_size": 130612249,
"miners_revenue_usd": 1.13766549673085E7,
"nextretarget": 528191,
"difficulty": 4940704885521,
"estimated_btc_sent": 15475738148615,
"miners_revenue_btc": 1810,
"total_btc_sent": 154234767570645,
"trade_volume_btc": 100767.92920045,
"trade_volume_usd": 6.333109167637314E8
}
Deeper Use Case Today's Blocks Apache Avro Schema in JSON Format Stored in Hortonworks Schema Registry {
"type": "record",
"name": "blocksfortoday",
"fields": [
{
"name": "height",
"type": "int",
"doc": "Type inferred from '527784'"
},
{
"name": "version",
"type": "int",
"doc": "Type inferred from '536870912'"
},
{
"name": "mrkl_root",
"type": "string",
"doc": "Type inferred from '\"c8f658ad595854f4c8c510b672447e838a2746e8724bfb26d0d127e5a4421385\"'"
},
{
"name": "timestamp",
"type": "int",
"doc": "Type inferred from '1529180826'"
},
{
"name": "bits",
"type": "int",
"doc": "Type inferred from '389609537'"
},
{
"name": "nonce",
"type": "int",
"doc": "Type inferred from '236046944'"
},
{
"name": "hash",
"type": "string",
"doc": "Type inferred from '\"00000000000000000017ca1d74bdee575dd48d4b3513eea2f7e06b313883d73d\"'"
},
{
"name": "prev_block_hash",
"type": "string",
"doc": "Type inferred from '\"00000000000000000000acf6259fffe63d36623c324f756faaf995a9e2896b87\"'"
},
{
"name": "next_block_hash",
"type": "string",
"doc": "Type inferred from '\"0000000000000000000000000000000000000000000000000000000000000000\"'"
},
{
"name": "size",
"type": "int",
"doc": "Type inferred from '189026'"
},
{
"name": "pool_difficulty",
"type": "long",
"doc": "Type inferred from '11831716619811'"
},
{
"name": "difficulty",
"type": "double",
"doc": "Type inferred from '4.940704885521827E12'"
},
{
"name": "tx_count",
"type": "int",
"doc": "Type inferred from '486'"
},
{
"name": "reward_block",
"type": "int",
"doc": "Type inferred from '1250000000'"
},
{
"name": "reward_fees",
"type": "int",
"doc": "Type inferred from '15691427'"
},
{
"name": "created_at",
"type": "int",
"doc": "Type inferred from '1529180835'"
},
{
"name": "confirmations",
"type": "int",
"doc": "Type inferred from '1'"
},
{
"name": "is_orphan",
"type": "boolean",
"doc": "Type inferred from 'false'"
},
{
"name": "curr_max_timestamp",
"type": "int",
"doc": "Type inferred from '1529180826'"
},
{
"name": "is_sw_block",
"type": "boolean",
"doc": "Type inferred from 'true'"
},
{
"name": "stripped_size",
"type": "int",
"doc": "Type inferred from '153817'"
},
{
"name": "weight",
"type": "int",
"doc": "Type inferred from '650477'"
},
{
"name": "extras",
"type": {
"type": "record",
"name": "extras",
"fields": [
{
"name": "pool_name",
"type": "string",
"doc": "Type inferred from '\"BTC.com\"'"
},
{
"name": "pool_link",
"type": "string",
"doc": "Type inferred from '\"https://pool.btc.com\"'"
}
]
},
"doc": "Type inferred from '{\"pool_name\":\"BTC.com\",\"pool_link\":\"https://pool.btc.com\"}'"
}
]
} QueryRecord Query SELECT *
FROM FLOWFILE
WHERE CAST(tx_count AS INT) > 0 Create an Apache Hive Table CREATE EXTERNAL TABLE IF NOT EXISTS blocksfortoday1 (height INT, version INT, mrkl_root STRING, timestamp INT, bits INT, nonce INT, hash STRING, prev_block_hash STRING, next_block_hash STRING, size INT, pool_difficulty BIGINT, difficulty DOUBLE, tx_count INT, reward_block INT, reward_fees INT, created_at INT, confirmations INT, is_orphan BOOLEAN, curr_max_timestamp INT, is_sw_block BOOLEAN, stripped_size INT, weight INT, extras STRUCT<pool_name:STRING, pool_link:STRING>)
STORED AS ORC
LOCATION '/blocksfortoday1' Example Apache Hive Query Run in Apache Zeppelin select * from blocksfortoday1 where CAST(tx_count as INT) > 500 order by created_at desc References: https://btc.com/ https://www.blockchain.com/en/explorer https://www.blockchain.com/en/api https://github.com/blockchain
... View more
Labels:
06-29-2018
09:20 PM
1 Kudo
Working with Infura.io Calling REST APIs is very easy with Apache NiFi, let's use this to ingest a lot of data about Ethereum blockchains and transactions. We can also ingest and examine network status data. Infura provides secure, reliable, scalable access to Ethereum and IPFS. Check out the current status: https://infura.io/status Apache NiFi Flows to Read From the Ethereum Blockchain via Infura REST APIs An Example REST Call We rename the file to make it unique and it note that it is from ethbtcfull API call. Example Data (JSON) {"base": "ETH", "quote": "BTC", "tickers": [{"bid": 0.0750827, "ask": 0.07530999, "volume": 5159.53235584, "timestamp": 1528924408, "exchange": "bitstamp"}, {"bid": 0.07519, "ask": 0.0752, "volume": 1938.0570499, "timestamp": 1528924408, "exchange": "gemini"}, {"bid": 0.07515, "ask": 0.07516, "volume": 10310.62442822, "timestamp": 1528924408, "exchange": "gdax"}, {"bid": 0.075208, "ask": 0.075226, "volume": 45253.638, "timestamp": 1528924409, "exchange": "hitbtc"}, {"bid": 0.075156, "ask": 0.07517, "volume": 25932.60119467, "timestamp": 1528924409, "exchange": "bitfinex"}, {"bid": 0.07503287, "ask": 0.0751326, "volume": 6713.29011055, "timestamp": 1528924409, "exchange": "exmo"}, {"bid": 0.075141, "ask": 0.075191, "volume": 136176.55, "timestamp": 1528924409, "exchange": "binance"}, {"bid": 0.073687, "ask": 0.07566, "volume": 99.58262408, "timestamp": 1528924409, "exchange": "quoine"}, {"bid": 0.074975, "ask": 0.075251, "volume": 698.857662, "timestamp": 1528924409, "exchange": "cex"}, {"bid": 0.07494, "ask": 0.07517503, "volume": 12079.03438486, "timestamp": 1528924409, "exchange": "livecoin"}, {"bid": 0.07421999, "ask": 0.07574378, "volume": 106.69757, "timestamp": 1528924410, "exchange": "btc_markets"}]} Calling INFURA REST APIs Note: For most use cases you do not need an API Key. Make sure you keep it under their limits and follow all of their terms of service. API Calls https://api.infura.io/v2/blacklist https://api.infura.io/v1/ticker/ethbtc/full https://api.infura.io/v1/ticker/ethbtc https://api.infura.io/v1/ticker/symbols Format The Files Example Expression Language ${filename:append('infurasymbols.'):append(${now():format('yyyymmddHHMMSS'):append(${md5}):append('.json')})} References:
https://blog.infura.io/getting-started-with-infura-28e41844cc89 https://infura.io/ https://infura.io/docs https://infura.io/status https://infura.docs.apiary.io/
... View more
Labels:
06-16-2018
02:38 PM
2 Kudos
Using Apache MXNet GluonCV with Apache NiFi for Deep Learning Computer Vision Source: https://github.com/tspannhw/OpenSourceComputerVision/ Gluon and Apache MXNet have been great for deep learning especially for newbies like me. It got even better! They added a Deep Learning Toolkit that is easy to use and has a number of great pre-trained models that you can easily use to do some general use cases around computer vision. So I have used a simple well-documented example that I tweaked to save the final image and send some JSON details via MQTT to Apache NiFi. This may sound familiar: https://community.hortonworks.com/articles/198912/ingesting-apache-mxnet-gluon-deep-learning-results.html GluonCV makes this even easier! Let's check it out. Again let's take a simpmle Python example tweak it, run it via a shell script and send the results over MQTT. See: https://gluon-cv.mxnet.io/build/examples_detection/demo_ssd.html#sphx-glr-build-examples-detection-demo-ssd-py Python Code: https://github.com/tspannhw/UsingGluonCV/tree/master This is the Saved Annotated Figure Simple Apache NiFi Flow to Ingest MQTT Data from GluonCV example Python and Store to Hive and Parquet and HBase. A simple flow:
ConsumeMQTT InferAvroSchema RouteOnContent MergeRecord (convert batches of json to single avro) ConvertAvroToORC PutHDFS PutParquet PutHbaseRecord Again Apache NiFi generates a schema for us from data examination. There's a really cool project coming out of New Jersey that has advanced schema generation looking at tables, I'll report on that later. We take it add, save to Schema Registry and are ready to Merge Records. One thing you may want to add is to turn regular types from: "type": "string" to "type": ["string","null"]. Schema {
"type": "record",
"name": "gluoncv",
"fields": [
{
"name": "imgname",
"type": "string",
"doc": "Type inferred from '\"images/gluoncv_image_20180615203319_6e0e5f0b-d2aa-4e94-b7e9-8bb7f29c9512.jpg\"'"
},
{
"name": "host",
"type": "string",
"doc": "Type inferred from '\"HW13125.local\"'"
},
{
"name": "shape",
"type": "string",
"doc": "Type inferred from '\"(1, 3, 512, 910)\"'"
},
{
"name": "end",
"type": "string",
"doc": "Type inferred from '\"1529094800.88097\"'"
},
{
"name": "te",
"type": "string",
"doc": "Type inferred from '\"2.4256367683410645\"'"
},
{
"name": "battery",
"type": "int",
"doc": "Type inferred from '100'"
},
{
"name": "systemtime",
"type": "string",
"doc": "Type inferred from '\"06/15/2018 16:33:20\"'"
},
{
"name": "cpu",
"type": "double",
"doc": "Type inferred from '23.2'"
},
{
"name": "diskusage",
"type": "string",
"doc": "Type inferred from '\"112000.8 MB\"'"
},
{
"name": "memory",
"type": "double",
"doc": "Type inferred from '65.8'"
},
{
"name": "id",
"type": "string",
"doc": "Type inferred from '\"20180615203319_6e0e5f0b-d2aa-4e94-b7e9-8bb7f29c9512\"'"
}
]
}
Example JSON {"imgname": "images/gluoncv_image_20180615203615_c83fed6f-2ec8-4841-97e3-40985f7859ad.jpg", "host": "HW13125.local", "shape": "(1, 3, 512, 910)", "end": "1529094976.237143", "te": "1.8907802104949951", "battery": 100, "systemtime": "06/15/2018 16:36:16", "cpu": 29.3, "diskusage": "112008.6 MB", "memory": 66.5, "id": "20180615203615_c83fed6f-2ec8-4841-97e3-40985f7859ad"} Table Generated CREATE EXTERNAL TABLE IF NOT EXISTS gluoncv (imgname STRING, host STRING, shape STRING, end STRING, te STRING, battery INT, systemtime STRING, cpu DOUBLE, diskusage STRING, memory DOUBLE, id STRING) STORED AS ORC LOCATION '/gluoncv' Parquet Table create external table gluoncv_parquet (imgname STRING, host STRING, shape STRING, end STRING, te STRING, battery INT, systemtime STRING, cpu DOUBLE, diskusage STRING, memory DOUBLE, id STRING) STORED AS PARQUET LOCATION '/gluoncvpar' Reference: https://gluon-cv.mxnet.io/ https://gluon-cv.mxnet.io/build/examples_detection/index.html https://medium.com/apache-mxnet/gluoncv-deep-learning-toolkit-for-computer-vision-9218a907e8da
... View more
Labels:
06-15-2018
04:11 PM
Adding Parquet Output https://cwiki.apache.org/confluence/display/Hive/Parquet create external table gluon2_parquet (top1pct STRING, top2pct STRING, top3pct STRING, top4pct STRING, top5pct STRING, top1 STRING, top2 STRING, top3 STRING, top4 STRING, top5 STRING,
imgname STRING, host STRING, `end` STRING, te STRING, battery INT, systemtime STRING, cpu DOUBLE, diskusage STRING, memory DOUBLE, id STRING)
STORED AS PARQUET
LOCATION '/gluon2par' select * from gluon2_parquet Add the PutParquet Processor
... View more
06-15-2018
03:16 PM
3 Kudos
Ingesting Apache MXNet Gluon Deep Learning Results Via MQTT and Apache NiFi Summary: Using a Pre-Trained Model in Apache MXNet Gluon Python 3 code to classify a webcam image captured and processed with OpenCV. In our Python script, we capture the image to disk and capture JSON metadata about the percentage, probabilities and device information. This JSON data is then sent via MQTT to a broker. Apache NiFi processes the JSON data. Example Image Source Code Schema: https://github.com/tspannhw/OpenSourceComputerVision/blob/master/gluon2.avsc Python Source: https://github.com/tspannhw/OpenSourceComputerVision/blob/master/nifigluon2.py Shell Script: https://github.com/tspannhw/OpenSourceComputerVision/blob/master/rungluon2.sh SQL Table DDL CREATE EXTERNAL TABLE IF NOT EXISTS gluon2 (top1pct STRING, top2pct STRING, top3pct STRING,
top4pct STRING, top5pct STRING, top1 STRING, top2 STRING, top3 STRING, top4 STRING,
top5 STRING, imgname STRING, host STRING, end STRING, te STRING, battery INT,
systemtime STRING, cpu DOUBLE, diskusage STRING, memory DOUBLE, id STRING)
STORED AS ORC
LOCATION '/gluon2' Technologies: Python 3, Apache MXNet, Gluon, MQTT, Apache NiFi, OpenCV. Based on http://gluon-crash-course.mxnet.io/predict.html Apache NiFi Overview ' Steps ConsumeMQTT: Ingest MQTT data from gluon2 topic sent from Python InferAvroSchema: One time grab the schema, then you can remove this processor. RouteOnContent: Throw away errors MergeRecord: Convert many JSON records into one large Apache AVRO file ConvertAvroToORC: Convert that Apache AVRO File into an Apache ORC file PutHDFS: Store the ApacheORC file in HDFS. A side effect of the process is that is produces a SQL DDL to create a new table for this schema. Table Example
... View more
Labels:
06-14-2018
09:15 PM
We are analyzing this unsplash picture https://raw.githubusercontent.com/tspannhw/DWS-DeepLearning-CrashCourse/master/photo1.jpg
... View more
06-14-2018
06:45 PM
This is a continuation of my series on running TensorFlow and Apache MXNet applications in HDP, HDF and in edge nodes. https://community.hortonworks.com/articles/118132/minifi-capturing-converting-tensorflow-inception-t.html https://dzone.com/articles/integrating-tensorflow-16-image-labelling-with-hdf https://community.hortonworks.com/articles/80339/iot-capturing-photos-and-analyzing-the-image-with.html https://community.hortonworks.com/articles/103863/using-an-asus-tinkerboard-with-tensorflow-and-pyth.html https://community.hortonworks.com/articles/83100/deep-learning-iot-workflows-with-raspberry-pi-mqtt.html
... View more
06-14-2018
06:30 PM
3 Kudos
Executing TensorFlow Classifications from Apache NiFi Using Apache Spark 2.3 and Apache Livy Technology: Apache Spark 2.3 + Apache Livy + Apache NiFi 1.5 + TensorFlow + Python Python Code: https://github.com/tspannhw/DWS-DeepLearning-CrashCourse/blob/master/tensorflowsparknifi.py TIP: In this version of Apache NiFi, you need to use double quotes (") instead of single quotes (') in your Python code. Python Code for NiFi ExecuteSparkInteractive See Github Simple Apache NiFi Flow To Execute TensorFlow Python Applications via Apache Livy I am just using Apache Livy as the transport call from Apache NiFi to Apache Spark. My Apache Spark 2.3 cluster is not doing any Spark specific processing. PySpark is just running a vanilla TensorFlow python application in this version. We could also call TensorFlow on Spark code in this way. My goal was to run TensorFlow on my Spark cluster trigger from Apache NiFi and get back results. Results Returned in Success From ExecuteSparkInteractive Call {
"text/plain" : "273\tracer, race car, racing car\t37.4601334333%\n\n274\tsports car, sport car\t25.3520905972%\n\n267\tcab, hack, taxi, taxicab\t11.1182622612%\n\n268\tconvertible\t9.85431224108%\n\n271\tminivan\t3.22951599956%"
} Apache Livy UI Showing Results of Runs This is the ExecuteSparkInteractive Processor. We can put the code in the Code property or pass it in. Let's Configure a PySpark Apache Livy Controller LogSearch There is a technical preview of LogSearch which is great for finding issues in HDF components or HDP components. This is easier then searching logs. Though I can easily write NiFi code to search logs as well. References: https://community.hortonworks.com/articles/177663/apache-livy-apache-nifi-apache-spark-executing-sca.html https://community.hortonworks.com/articles/171787/hdf-31-executing-apache-spark-via-executesparkinte.html
... View more
Labels: