1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2458 | 04-03-2024 06:39 AM | |
| 3806 | 01-12-2024 08:19 AM | |
| 2053 | 12-07-2023 01:49 PM | |
| 3037 | 08-02-2023 07:30 AM | |
| 4156 | 03-29-2023 01:22 PM |
10-12-2016
03:33 PM
3 Kudos
Often lines of business, individual users or shared teams will use online Google Sheets to share spreadsheet and tabular data amongst teams or without outside vendors. It's quick and easy to add sheets and store your data in Google Drive as spreadsheets. Often you will want to consolidate, federate, analyze, enrich and use this data for reporting and dashboards throughout your organization. An easy way to do that is to read in the data using Google's Sheet API. This is a standard SSL HTTP REST API that returns clean JSON data. I created a simple Google Sheet to test ingesting a Google Sheet with HDF. You will need to enable Google Sheets API in the Google APIs Console. You must be logged into Google and have a Google Account (use the one where you created your Google Spreadsheets). Google Documentation Google provides a few Quick starts that you can use to ingest this data: https://developers.google.com/sheets/quickstart/js or https://developers.google.com/sheets/quickstart/python. I chose to ingest this data the easiest way with a simple REST call from NIFI. Testing Your Queries in Google's API Explorer To test your queries and get your exact URL, go to Google's API Explorer: https://developers.google.com/apis-explorer/#p/sheets/v4/ GET https://sheets.googleapis.com/v4/spreadsheets/1sbMyDocID?includeGridData=true&key=MYKEYISFROMGOOGLE Where 1sb… is the document id that comes from the name you see in your google sheet page like so: https://docs.google.com/spreadsheets/d/1UMyDocumentId/edit#g. Calling the API From HDF 2.0 The one thing you will need is to setup a StandardSSLContextService to read in HTTPS data. You will need to grab the truststore file cacerts for the JRE that NiFi is using to run. By default the Truststore Password is changeit. You really should change it. Once you have an SSL configuration setup, then you can do a GetHTTP. You add in the Sheets GoogleAPI URL that includes the Sheet ID. I also set the User Agent, Accept Content-type and Follow Redirects = True. Now that we have SSL enabled, we can make our call to Google. The flow below is pretty simple. Now that I have ingested the Google Sheet, I can store it as JSON in my data lake. You could process this in HDF many ways including taking out fields, enriching with other data sources, converting to AVRO or ORC, storing in a HIVE table, Phoenix or HBase. You have now ingested Google Sheet data. Determining what you want to do to it and parsing out the JSON is a fun exercise. You can use an EvaluateJsonPath processor in Apache NiFi to pull out fields you want. Inside that processor you add a field and then a value like so $.entities.media[0].media_url that runs JsonPath HDF 2.0 Diagram Overview Reference: https://community.hortonworks.com/articles/59349/hdf-20-flow-for-ingesting-real-time-tweets-from-st.html http://jsonpath.com/ https://blogs.apache.org/nifi/entry/indexing_tweets_with_nifi_and https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.EvaluateJsonPath/ https://community.hortonworks.com/questions/21011/how-i-extract-attribute-from-json-file-using-nifi.html https://jsonpath.curiousconcept.com/ https://developers.google.com/sheets/guides/authorizing https://codelabs.developers.google.com/codelabs/sheets-api/#0 https://developers.google.com/sheets/samples/
... View more
Labels:
10-07-2016
03:39 PM
3 Kudos
With NiFi 1.00 I ingested a lot of image data from drones, mostly to get metadata like geo. I also ingested a resized version of the image in case I wanted to use it. I found a use for it. I am pulling this out very simply with Spring for a simple HTML page. So I wrote a quick Java program to pull out fields I stored in Phoenix (from the metadata) and I wanted to display the image. I could have streamed it out of HDFS using HDFS libraries to read the file and then stream it. sql = "select datekey, fileName, gPSAltitude, gPSLatitude, gPSLongitude, orientation,geolat,geolong,inception from dronedata1 order by datekey asc";
out.append(STATIC_HEADER);
PreparedStatement ps = connection.prepareStatement(sql);
ResultSet res = ps.executeQuery();
while (res.next()) {
try {
out.append("<br><br>\n<table width=100%><tr><td valign=top><img src=\"");
out.append("http://tspanndev10.field.hortonworks.com:50070/webhdfs/v1/drone/").
append(res.getString("fileName")).append("?op=OPEN\"></td>");
out.append("<td valign=top>Date: ").append(res.getString("datekey"));
out.append("\n<br>Lat: ").append(res.getString("geolat"));
out.append("\n<br>Long: ").append(res.getString("geolong"));
out.append("\n<br><br>\n</td></tr></table>\n");
} catch (Exception e) {
e.printStackTrace();
}
}
It was a lot easier to use the built-in WebHDFS to display an image. Wrapping the Web API call to the image file in an HTML IMG SRC tag loads our image. http://node1:50070/webhdfs/v1/drone/Bebop2_20160920083655-0400.jpg?op=OPEN It's pretty simple and you can use this with a MEAN application, Python Flask or your non-JVM front-end of choice. And now you have a solid distributed host for your images. I recommend this only for internal sites and public images. Having this data publicly available on the cloud is dangerous!
... View more
Labels:
10-05-2016
01:39 PM
5 Kudos
1. Acquire an EDI File (GetFile, GetFTP, GetHTTP, GetSFTP, Fetch...)
2. Install open source nifi-edireader on NIFI 1.0.0
Download https://github.com/BerryWorksSoftware/edireader
Maven Install Berry Works EDIReader
Download https://github.com/mrcsparker/nifi-edireader-bundle
Maven packge nifi-edireader (must be Maven 3.3 or newer - may have to download and install separately from standard linux package)
cp nifi-edireader-nar/target/nifi-edireader-nar-0.0.1.nar to your NIFI/lib
Restart NiFi Service
3. Add EdiXML Processor and connect from EDI File input
4. Add extra processing, conversion or routing (TransformXML with XSLT or EvaluateXPATH) to convert to JSON
5. Land to HDFS (PutHDFS)
6. Used the Web Form linked below to generate a test EDI file.
ISA*00* *00* *ZZ*SENDER ID *ZZ*RECEIVER ID *010101*0101*U*00401*000000001*0*T*!
GS*IN*SENDER ID*APP RECEIVER*01010101*01010101*1*X*004010
ST*810*0001
BIG*20021208*00001**A999
N1*ST*Timothy Spann*9*122334455
N3*115 xxx ave
N4*xxxtown*nj*08520
N1*BT*Hortonworks*9*122334455
N3*5470 GREAT AMERICA PARKWAY
N4*santa clara*CA*95054
ITD*01*3*2**30**30*****60
FOB*PP
IT1**1*EA*200**UA*EAN
PID*F****Lamp
IT1**4*EA*50**UA*EAN
PID*F****Chair
TDS*2000
CAD*****Routing
ISS*30*CA
CTT*50
SE*19*0001
GE*1*1
IEA*1*000000001
7. Converted to XML
<?xml version="1.0" encoding="UTF-8"?>
<ediroot>
<interchange Standard="ANSI X.12"
AuthorizationQual="00"
Authorization=" "
SecurityQual="00"
Security=" "
Date="010101"
Time="0101"
StandardsId="U"
Version="00401"
Control="000000001"
AckRequest="0"
TestIndicator="T">
<sender>
<address Id="SENDER ID " Qual="ZZ"/>
</sender>
<receiver>
<address Id="RECEIVER ID " Qual="ZZ"/>
</receiver>
<group GroupType="IN"
ApplSender="SENDER ID"
ApplReceiver="APP RECEIVER"
Date="01010101"
Time="01010101"
Control="1"
StandardCode="X"
StandardVersion="004010">
<transaction DocType="810" Name="Invoice" Control="0001">
<segment Id="BIG">
<element Id="BIG01">20021208</element>
<element Id="BIG02">00001</element>
<element Id="BIG04">A999</element>
</segment>
<loop Id="N1">
<segment Id="N1">
<element Id="N101">ST</element>
<element Id="N102">Timothy Spann</element>
<element Id="N103">9</element>
<element Id="N104">122334455</element>
</segment>
<segment Id="N3">
<element Id="N301">115 xxx ave</element>
</segment>
<segment Id="N4">
<element Id="N401">xxxstown</element>
<element Id="N402">nj</element>
<element Id="N403">08520</element>
</segment>
</loop>
<loop Id="N1">
<segment Id="N1">
<element Id="N101">BT</element>
<element Id="N102">Hortonworks</element>
<element Id="N103">9</element>
<element Id="N104">122334455</element>
</segment>
<segment Id="N3">
<element Id="N301">5470 GREAT AMERICA PARKWAY</element>
</segment>
<segment Id="N4">
<element Id="N401">santa clara</element>
<element Id="N402">CA</element>
<element Id="N403">95054</element>
</segment>
</loop>
<segment Id="ITD">
<element Id="ITD01">01</element>
<element Id="ITD02">3</element>
<element Id="ITD03">2</element>
<element Id="ITD05">30</element>
<element Id="ITD07">30</element>
<element Id="ITD12">60</element>
</segment>
<segment Id="FOB">
<element Id="FOB01">PP</element>
</segment>
<loop Id="IT1">
<segment Id="IT1">
<element Id="IT102">1</element>
<element Id="IT103">EA</element>
<element Id="IT104">200</element>
<element Id="IT106">UA</element>
<element Id="IT107">EAN</element>
</segment>
<loop Id="PID">
<segment Id="PID">
<element Id="PID01">F</element>
<element Id="PID05">Lamp</element>
</segment>
</loop>
</loop>
<loop Id="IT1">
<segment Id="IT1">
<element Id="IT102">4</element>
<element Id="IT103">EA</element>
<element Id="IT104">50</element>
<element Id="IT106">UA</element>
<element Id="IT107">EAN</element>
</segment>
<loop Id="PID">
<segment Id="PID">
<element Id="PID01">F</element>
<element Id="PID05">Chair</element>
</segment>
</loop>
</loop>
<segment Id="TDS">
<element Id="TDS01">2000</element>
</segment>
<segment Id="CAD">
<element Id="CAD05">Routing</element>
</segment>
<loop Id="ISS">
<segment Id="ISS">
<element Id="ISS01">30</element>
<element Id="ISS02">CA</element>
</segment>
</loop>
<segment Id="CTT">
<element Id="CTT01">50</element>
</segment>
</transaction>
</group>
</interchange>
</ediroot>
Resources
https://github.com/mrcsparker/nifi-edireader-bundle
https://github.com/BerryWorksSoftware/edireader
https://en.wikipedia.org/wiki/Electronic_data_interchange
https://en.wikipedia.org/wiki/EDIFACT
https://en.wikipedia.org/wiki/FORTRAS
http://databene.org/edifatto.html
https://sourceforge.net/projects/edifatto/
https://secure.edidev.net/edidev-ca/samples/vbNetGen/WebFrmNetGen.aspx (Generate example EDI)
... View more
Labels:
10-01-2016
11:13 PM
2 Kudos
I ran the same flow myself and examined the AVRO file in HDFS using AVRO Cli. Even though I didn't specify SNAPPY compression, it was there in the file. java -jar avro-tools-1.8.0.jar getmeta 23568764174290.avro
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
avro.codec snappyavro.schema {"type":"record","name":"people","doc":"Schema generated by Kite","fields":[{"name":"id","type":"long","doc":"Type inferred from '2'"},{"name":"first_name","type":"string","doc":"Type inferred from 'Gregory'"},{"name":"last_name","type":"string","doc":"Type inferred from 'Vasquez'"},{"name":"email","type":"string","doc":"Type inferred from 'gvasquez1@pcworld.com'"},{"name":"gender","type":"string","doc":"Type inferred from 'Male'"},{"name":"ip_address","type":"string","doc":"Type inferred from '32.8.254.252'"},{"name":"company_name","type":"string","doc":"Type inferred from 'Janyx'"},{"name":"domain_name","type":"string","doc":"Type inferred from 'free.fr'"},{"name":"file_name","type":"string","doc":"Type inferred from 'NonMauris.xls'"},{"name":"mac_address","type":"string","doc":"Type inferred from '03-FB-66-0F-20-A3'"},{"name":"user_agent","type":"string","doc":"Type inferred from '\"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_7;'"},{"name":"lat","type":"string","doc":"Type inferred from ' like Gecko) Version/5.0.4 Safari/533.20.27\"'"},{"name":"long","type":"double","doc":"Type inferred from '26.98829'"}]} It's hard coded in NIFI. https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-kite-bundle/nifi-kite-processors/src/main/java/org/apache/nifi/processors/kite/ConvertCSVToAvro.java It always adds SnappyCompression to every AVRO file. No options. writer.setCodec(CodecFactory.snappyCodec()); Make sure you have a schema set: Record schema Record Schema: ${inferred.avro.schema} If you can make everything Strings and convert to other types later, you will be happier. References: https://www.linkedin.com/pulse/converting-csv-avro-apache-nifi-jeremy-dyer https://community.hortonworks.com/questions/44063/nifi-avro-to-csv-or-json-to-csvnifi-convert-avro-t.html https://community.hortonworks.com/articles/28341/converting-csv-to-avro-with-apache-nifi.html
... View more
Labels:
09-30-2016
10:40 PM
Here's the simple zeppelin file. twitter-from-strata-hadoop-processing.txt
Rename that as .JSON. For security, don't upload/download are working with .JS or .JSON fies.
... View more
10-11-2016
08:41 PM
TensorFlow 0.11 is out export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.11.0rc0-cp27-none-linux_x86_64.whl
... View more
10-02-2018
05:42 PM
Not a kerberized cluster. maybe: https://stackoverflow.com/questions/40595332/how-to-connect-to-a-kerberos-secured-apache-phoenix-data-source-with-wildfly
... View more
09-14-2016
02:59 AM
3 Kudos
Running Spark Jobs Through Apache Beam on HDP 2.5 Yarn Cluster
Using the Spark Runner with Apache Beam Apache Beam is still in incubator and not supported on HDP 2.5 or other platforms. sudo yum -y install git
wget http://www.gtlib.gatech.edu/pub/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz After you get Maven downloaded, move it to /opt/demo/maven or into your path. The maven download mirror will change, so grab a new URL from http://maven.apache.org/. Using Yum will give you an older Maven not supported and may interfere with something else. So I recommend getting a new Maven just for this build. Make sure you have Java 7 or greater, which you should have on an Apache machine. I am recommending Java 8 on your new HDP 2.5 nodes if possible. cd /opt/demo/
git clone https://github.com/apache/incubator-beam
cd incubator-beam
/opt/demo/maven/bin/mvn clean install -DskipTests If you want to run this on Spark 2.0 and not Spark 1.6.2, look here for changing environment: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_spark-component-guide/content/spark-choose-version.html For HDP 2.5, these are the parameters: spark-submit --class org.apache.beam.runners.spark.examples.WordCount --master yarn-client target/beam-runners-spark-0.3.0-incubating-SNAPSHOT-spark-app.jar --inputFile=kinglear.txt --output=out --runner=SparkRunner --sparkMaster=yarn-client Note, I had to change the parameters to get this to work in my environment. You may also need to do /opt/demo/maven/bin/mvn package from the /opt/demo/incubator-beam/runners/spark directory. This is running a Java 7 example from the built-in examples: https://github.com/apache/incubator-beam/tree/master/examples/java These are the results of running our small Spark job. 16/09/14 02:35:08 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 34.0 KB, free 518.7 KB) 16/09/14 02:35:08 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.26.195.58:39575 (size: 34.0 KB, free: 511.1 MB) 16/09/14 02:35:08 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1008 16/09/14 02:35:08 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[14] at mapToPair at TransformTranslator.java:568) 16/09/14 02:35:08 INFO YarnScheduler: Adding task set 1.0 with 2 tasks 16/09/14 02:35:08 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, tspanndev13.field.hortonworks.com, partition 0,NODE_LOCAL, 1994 bytes) 16/09/14 02:35:08 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, tspanndev13.field.hortonworks.com, partition 1,NODE_LOCAL, 1994 bytes) 16/09/14 02:35:08 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on tspanndev13.field.hortonworks.com:36438 (size: 34.0 KB, free: 511.1 MB) 16/09/14 02:35:08 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on tspanndev13.field.hortonworks.com:36301 (size: 34.0 KB, free: 511.1 MB) 16/09/14 02:35:08 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to tspanndev13.field.hortonworks.com:52646 16/09/14 02:35:08 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 177 bytes 16/09/14 02:35:08 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to tspanndev13.field.hortonworks.com:52640 16/09/14 02:35:09 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 681 ms on tspanndev13.field.hortonworks.com (1/2) 16/09/14 02:35:09 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 1112 ms on tspanndev13.field.hortonworks.com (2/2) 16/09/14 02:35:09 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool 16/09/14 02:35:09 INFO DAGScheduler: ResultStage 1 (saveAsNewAPIHadoopFile at TransformTranslator.java:745) finished in 1.113 s 16/09/14 02:35:09 INFO DAGScheduler: Job 0 finished: saveAsNewAPIHadoopFile at TransformTranslator.java:745, took 5.422285 s 16/09/14 02:35:09 INFO SparkRunner: Pipeline execution complete. 16/09/14 02:35:09 INFO SparkContext: Invoking stop() from shutdown hook [root@tspanndev13 spark]# hdfs dfs -ls
Found 5 items drwxr-xr-x - root hdfs 0 2016-09-14 02:35 .sparkStaging
-rw-r--r-- 3 root hdfs 0 2016-09-14 02:35 _SUCCESS
-rw-r--r-- 3 root hdfs 185965 2016-09-14 01:44 kinglear.txt
-rw-r--r-- 3 root hdfs 27304 2016-09-14 02:35 out-00000-of-00002
-rw-r--r-- 3 root hdfs 26515 2016-09-14 02:35 out-00001-of-00002
[root@tspanndev13 spark]# hdfs dfs -cat out-00000-of-00002
oaths: 1
bed: 7
hearted: 5
warranties: 1
Refund: 1
unnaturalness: 1
sea: 7
sham'd: 1
Only: 2
sleep: 8
sister: 29
Another: 2
carbuncle: 1 As you can see as expected it produced the two part output file in HDFS with wordcounts. Not much configuration is required to run your Apache Beam Java jobs on your HDP 2.5 YARN Spark Cluster, so if you have a development cluster, this would be a great place to try it out. Our on your own HDP 2.5 sandbox. Resources: http://beam.incubator.apache.org/learn/programming-guide/ https://github.com/apache/incubator-beam/tree/master/runners/spark
... View more
Labels:
09-15-2016
02:15 AM
Flow File: sensor.xml
... View more
05-01-2017
04:26 PM
Thanks a lot for this article. What are you using to run TF on Spark in this configuration?
... View more