1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2453 | 04-03-2024 06:39 AM | |
| 3802 | 01-12-2024 08:19 AM | |
| 2049 | 12-07-2023 01:49 PM | |
| 3034 | 08-02-2023 07:30 AM | |
| 4153 | 03-29-2023 01:22 PM |
11-01-2016
01:37 PM
For Phoenix, updates and inserts are handled by the Upsert command. If it's the same data it will be updated, if not inserted. 1) For delete, you can call an ExecuteProcess to delete with a command line program. 1b) For delete, you can call ExecuteScript to have Python/Groovy/Javascript delete it. 2) For delete, you can send a message to Spark Streaming to delete it via SparkSQL (using SitetoSite or Kafka) Follows general SQL rules. Apache Phoenix Delete Reference https://phoenix.apache.org/language/index.html#delete
... View more
10-30-2016
04:15 PM
9 Kudos
INGESTING RDBMS DATA I previously posted an article on ingesting and converting data (https://community.hortonworks.com/articles/64069/converting-a-large-json-file-into-csv.html). Once you have a SQL database loaded, you will eventually need to store your data in your one unified datalake. This is quite simple with NiFi. If you have a specialized tool that reads from your RDBMS logs and sends them to Kafka or JMS, that would be easy to ingest as well. For those wishing to stay open source, NiFi works great. If you don't have a good increasing key to use, you can add an article one that increases on every insert. Almost every database supports this from MariaDB to Oracle. ALTER TABLE `useraccount` ADD COLUMN `id` INT AUTO_INCREMENT UNIQUE FIRST; For mine, I just added an autoincrement id column to be my trigger. For Apache NiFi, you will need connections to all your sources and sinks. So I need a DB Connection Pool for Apache Phoenix and MySQL (DBCPConnectionPool) as well as Hive (HiveConnectionPool). Tools Required:
RDMS (I am using MySQL) HDF 2.0 (NiFi 1.0.0+) HDP 2.4+ (I am using HDP 2.5) with HBase and Phoenix enabled and running, HDFS, YARN and Hive running. Optional: Apache Zeppelin for quick data analysis and validation To build a SQL database, I needed a source of interesting and plentiful data. So I used the excellent free API: https://api.randomuser.me/. It's easy to get this URL to return 5,000 formatted JSON results via the extra parameters: ?results=3&format=pretty. This API returns JSON in this format that requires some basic transformation (easily done in NiFi). {"results":[
{"gender":"male",
"name":{"title":"monsieur","first":"lohan","last":"marchand"},
"location":{"street":"6684 rue jean-baldassini","city":"auboranges","state":"schwyz","postcode":9591},
"email":"lohan.marchand@example.com",
"login":{"username":"biggoose202","password":"esther","salt":"QIU1HBsr","md5":"9e60da6d4490cd6d102e8010ac98f283","sha1":"3de3ea419da1afe5c83518f8b46f157895266d17","sha256":"c6750c1a5bd18cac01c63d9e58a57d75520861733666ddb7ea6e767a7460479b"},
"dob":"1965-01-28 03:56:58",
"registered":"2014-07-26 11:06:46",
"phone":"(849)-890-5523",
"cell":"(395)-127-9369",
"id":{"name":"AVS","value":"756.OUVK.GFAB.51"},
"picture":{"large":"https://randomuser.me/api/portraits/men/69.jpg","medium":"https://randomuser.me/api/portraits/med/men/69.jpg","thumbnail":"https://randomuser.me/api/portraits/thumb/men/69.jpg"},"nat":"CH"}]
Then I created a MySQL table to populate with JSON data. drop table useraccount; create table useraccount(
gender varchar(200),
title varchar(200),
first varchar(200),
last varchar(200),
street varchar(200),
city varchar(200),
state varchar(200),
postcode varchar(200),
email varchar(200),
username varchar(200),
password varchar(200),
salt varchar(200),
md5 varchar(200),
sha1 varchar(200),
sha256 varchar(200),
dob varchar(200),
registered varchar(200),
phone varchar(200),
cell varchar(200),
name varchar(200),
value varchar(200),
large varchar(200),
medium varchar(200),
thumbnail varchar(200),
nat varchar(200)); I created a Phoenix table ontop of HBase to hold data: create table useraccount(
gender varchar,
title varchar,
firstname varchar,
lastname varchar,
street varchar,
city varchar,
state varchar,
postcode varchar,
email varchar,
username varchar,
password varchar,
salt varchar,
md5 varchar not null primary key,
sha1 varchar,
sha256 varchar,
dob varchar,
registered varchar,
phone varchar,
cell varchar,
name varchar,
value2 varchar,
large varchar,
medium varchar,
thumbnail varchar,
nat varchar); Step 1: QueryDatabaseTable Reads from MySQL tables. This processor just needs the MySQL Connection, table name: useraccount and column: id. With have two forks from this query table. Fork 1 Step 2: ConvertAvroToJSON Use Array You will get arrays of JSON that look like this: {
"id" : 656949,
"gender" : "female",
"title" : "madame",
"first" : "amandine",
"last" : "sanchez",
"street" : "8604 place paul-duquaire",
"city" : "savigny",
"state" : "genève",
"postcode" : "5909",
"email" : "amandine.sanchez@example.com",
"username" : "ticklishmeercat183",
"password" : "hillary",
"salt" : "Sgq7HHP1",
"md5" : "d82d6c3524f3a1118399113e6c43ed31",
"sha1" : "23ce2b372f94d39fb949d95e81e82bece1e06a4a",
"sha256" : "49d7e92a2815df1d5fd991ce9ebbbcdffee4e0e7fe398bc32f0331894cae1154",
"dob" : "1983-05-22 15:16:49",
"registered" : "2011-02-06 22:03:37",
"phone" : "(518)-683-8709",
"cell" : "(816)-306-5232",
"name" : "AVS",
"value" : "756.IYWK.GJBH.35",
"large" : "https://randomuser.me/api/portraits/women/50.jpg",
"medium" : "https://randomuser.me/api/portraits/med/women/50.jpg",
"thumbnail" : "https://randomuser.me/api/portraits/thumb/women/50.jpg",
"nat" : "CH"
} Step 3: SplitJSON Use: $.* to split all the arrays into individual JSON records. Step 4: EvaluateJSONPath You need to pull out each attribute you want and name it, example cell for $.cell See the guide to JSONPath with testing tool here. Step 5: ReplaceText Here we format the SQL from the attributes we just parsed from JSON: upsert into useraccount (gender,title,firstname,lastname,street,city,state,postcode,email,
username,password,salt,md5,sha1,sha256,dob,registered,phone,cell,name,value2,large,medium,thumbnail,nat)
values ('${'gender'}','${'title'}','${'first'}','${'last'}','${'street'}','${'city'}','${'state'}','${'postcode'}',
'${'email'}','${'username'}','${'password'}','${'salt'}','${'md5'}','${'sha1'}','${'sha256'}','${'dob'}',
'${'registered'}','${'phone'}','${'cell'}','${'name'}','${'value'}','${'large'}','${'medium'}','${'thumbnail'}','${'nat'}' )
Step 6: PutSQL With an example Batch Size of 100, we connect to our Phoenix DB Connection Pool. Fork 2 Step 2: UpdateAttribute We set orc.table to useraccount Step 3: ConvertAvroToORC We set our configuration files for Hive: /etc/hive/conf/hive-site.xml, 64MB stripe, and importantly Hive Table Name to ${orc.table} Step 4: PutHDFS Set out configuration /etc/hadoop/conf/core-site.xml and a directory you have access to write to for storing the ORC files. Step 5: ReplaceText Search Value: (?s:^.*$) Replacement Value: ${hive.ddl} LOCATION '${absolute.hdfs.path}' Always replace and entire text. Step 6: PutHiveQL You need to connect to your Hive Connection. You will see the resulting ORC files in your HDFS directory
[root@tspanndev12 demo]# hdfs dfs -ls /orcdata
Found 2 items
-rw-r--r-- 3 root hdfs 246806 2016-10-29 01:24 /orcdata/2795061363634412.orc
-rw-r--r-- 3 root hdfs 246829 2016-10-29 17:25 /orcdata/2852682816977877.orc
After my first few batches of data are ingested, I check them in Apache Zeppelin. Looks good. The data has also been loaded into Apache Hive.
... View more
Labels:
10-28-2016
08:45 PM
4 Kudos
Apache NiFi 1.0.0 has a lot of cool features, but no JSON-to-CSV converter yet. For my use case, I wanted to split one JSON file that had 5,000 records into 5,000 CSVs. You could also through a MergeContent processor in there and make one file. My data came from the excellent test data source, RandomUser. They provide a free API to pull down data, https://api.randomuser.me/?results=5000&format=pretty. I chose pretty to get formatted JSON with multiple lines which is easier to parse. This is an example of JSON returned from that service: {"results":[
{"gender":"male",
"name":{"title":"monsieur","first":"lohan","last":"marchand"},
"location":{"street":"6684 rue jean-baldassini","city":"auboranges","state":"schwyz","postcode":9591},
"email":"lohan.marchand@example.com",
"login":{"username":"biggoose202","password":"esther","salt":"QIU1HBsr","md5":"9e60da6d4490cd6d102e8010ac98f283","sha1":"3de3ea419da1afe5c83518f8b46f157895266d17","sha256":"c6750c1a5bd18cac01c63d9e58a57d75520861733666ddb7ea6e767a7460479b"},
"dob":"1965-01-28 03:56:58",
"registered":"2014-07-26 11:06:46",
"phone":"(849)-890-5523",
"cell":"(395)-127-9369",
"id":{"name":"AVS","value":"756.OUVK.GFAB.51"},
"picture":{"large":"https://randomuser.me/api/portraits/men/69.jpg","medium":"https://randomuser.me/api/portraits/med/men/69.jpg","thumbnail":"https://randomuser.me/api/portraits/thumb/men/69.jpg"},
"nat":"CH"}
]
Step 1: GetFile: Read the JSON file from a directory (I can process any JSON file in there) Step 2: SplitJSON: Split on $.results to split array of JSON. Step 3: EvaluateJSONPath: Pull out attributes from the JSON record sent. Example: $.cell Step 4: UpdateAttribute: Update the filename to be unique ${md5}.csv. Step 5: ReplaceText: To format JSON attributes into a line of command-separated values. Step 6: PutFile: Store the resulting file in a directory. (This could also be PutHDFS or many other sink processors).
... View more
Labels:
10-24-2016
10:13 PM
1 Kudo
SysDig SysDig (Github) is an open source tool that allows for the exploration, analysis and trouble shooting of Linux systems and containers. It is well documented and very easy to install and use. It can be used for container and Linux system diagnostics, security analysis, monitoring and basic system information capture. Remember that sysdig can produces thousands of lines of messages and can continue doing so forever depending on the options selected. Check out the examples and read through all the options, you can monitor a ton of data really fast and also check for security anomalies. NiFi Ingesting SysDig Sysdig can produce amazing amounts of logs. I chose to ingest 1 second chunks as ASCII JSON. I selected those options and listed them below. The results are arrays of JSON. I decided it's best to save them as a large JSON files for now and convert them to ORC later for Hive analysis in Zeppelin. You could also split them into individual JSON rows and process those. I also save them to Apache Phoenix for fast queries. ExecuteProcess Command sysdig -A -j -M 1 --unbuffered I just wrap that in a shell script for neatness. HDF 2.0 / NIFI 1.0.0 Flow Event JSON from SysDig {"evt.cpu":6,"evt.dir":">","evt.info":"fd=7(<f>/usr/lib64/python2.7/lib-dynload/_elementtree.so) ","evt.num":111138,"evt.outputtime":1477313882635597873,"evt.type":"fstat",
"proc.name":"python","thread.tid":14602}
Apache Phoenix Table CREATE TABLE sysdigevents
(
evtcpu varchar,
evtdir varchar,
evtinfo varchar,
evtoutputtime varchar,
evttype varchar,
procname varchar,
threadtid varchar,
evtnum varchar not null primary key
);
Links Sysdig Examples Sysdig Cheatsheet mapping to legacy tools Monitor Linux Server with sysdig NiFi Flow sysdig.xml
... View more
Labels:
10-14-2016
07:16 PM
2 Kudos
Sometimes you have Java messages that you would like to easily ingest into HDFS or perhaps HDFS as raw files, Phoenix, Hive and other destinations. You can do that pretty easy with Apache NiFi 1.0.0 as part of HDF 2.0. For this simple example, I also added a REST gateway for bulk loading, testing and to provide another way to easily send JMS messages. ListenHTTP accepts HTTP POSTS on port 8099, which I made the listener port for that processor. It takes what you send and publishes that to a JMS queue. I am using ActiveMQ. I have a little Python 2.7 script that I found on github that makes fake log records and modified it to send 1,000 JSON messages via REST to our REST to JMS gateway in NIFI for testing. You can easily do this with shell script and CURL, Apache JMeter, Java code, Go script and many other open source REST testers and clients. url = 'http://server.com:8099/contentListener'
r = requests.post(url, json={"rtimestamp": timestamp, "ip": random_ip(), "country": country, "status": status}) I installed an ActiveMQ JMS broker as my example JMS server, which is very simple on Centos 7. All you need to do is download the gziped tar and untar it. It's ready to run with a chmod. That download also includes the client jar that we will need on the HDF 2.0 server for accessing the message queue server. You must also have the port open. On ActiveMQ that defaults to 61616. ActiveMQ also includes a nice web console that you may want to unblock that port for viewing the status of queues and messages. In my simple example, I am running JMS via: bin/activemq start > /tmp/smlog 2>&1 &; I recommend changing your HTTP Listening Port, so you can run a bunch of these processors as needed. Processors used: ConsumeJMS, MergeContent and PutHDFS. You need to set Destination Name which is the name of the QUEUE in this case, but could also be the name of the Topic. I picked Destination Type of QUEUE since I am using a QUEUE in Apache ActiveMQ. It's very easy to add more output processors for sinking data into Apache Phoenix, HBase, Hive, Email, Slack and other NoSQL stores. It's also easy to convert messages into AVRO, ORC and other optimized big data file formats. As you see we get a number of jms_ attributes including priority, message ID and other attributes associated with the JMS message. Example Message
ActiveMQ Screens References:
https://community.hortonworks.com/articles/59349/hdf-20-flow-for-ingesting-real-time-tweets-from-st.html https://community.hortonworks.com/articles/59975/ingesting-edi-into-hdfs-using-hdf-20.html http://activemq.apache.org/uri-protocols.html http://activemq.apache.org/initial-configuration.html http://activemq.apache.org/version-5-getting-started.html http://www.apache.org/dyn/closer.cgi?filename=/activemq/5.14.1/apache-activemq-5.14.1-bin.tar.gz&action=download
... View more
Labels:
10-12-2016
03:33 PM
3 Kudos
Often lines of business, individual users or shared teams will use online Google Sheets to share spreadsheet and tabular data amongst teams or without outside vendors. It's quick and easy to add sheets and store your data in Google Drive as spreadsheets. Often you will want to consolidate, federate, analyze, enrich and use this data for reporting and dashboards throughout your organization. An easy way to do that is to read in the data using Google's Sheet API. This is a standard SSL HTTP REST API that returns clean JSON data. I created a simple Google Sheet to test ingesting a Google Sheet with HDF. You will need to enable Google Sheets API in the Google APIs Console. You must be logged into Google and have a Google Account (use the one where you created your Google Spreadsheets). Google Documentation Google provides a few Quick starts that you can use to ingest this data: https://developers.google.com/sheets/quickstart/js or https://developers.google.com/sheets/quickstart/python. I chose to ingest this data the easiest way with a simple REST call from NIFI. Testing Your Queries in Google's API Explorer To test your queries and get your exact URL, go to Google's API Explorer: https://developers.google.com/apis-explorer/#p/sheets/v4/ GET https://sheets.googleapis.com/v4/spreadsheets/1sbMyDocID?includeGridData=true&key=MYKEYISFROMGOOGLE Where 1sb… is the document id that comes from the name you see in your google sheet page like so: https://docs.google.com/spreadsheets/d/1UMyDocumentId/edit#g. Calling the API From HDF 2.0 The one thing you will need is to setup a StandardSSLContextService to read in HTTPS data. You will need to grab the truststore file cacerts for the JRE that NiFi is using to run. By default the Truststore Password is changeit. You really should change it. Once you have an SSL configuration setup, then you can do a GetHTTP. You add in the Sheets GoogleAPI URL that includes the Sheet ID. I also set the User Agent, Accept Content-type and Follow Redirects = True. Now that we have SSL enabled, we can make our call to Google. The flow below is pretty simple. Now that I have ingested the Google Sheet, I can store it as JSON in my data lake. You could process this in HDF many ways including taking out fields, enriching with other data sources, converting to AVRO or ORC, storing in a HIVE table, Phoenix or HBase. You have now ingested Google Sheet data. Determining what you want to do to it and parsing out the JSON is a fun exercise. You can use an EvaluateJsonPath processor in Apache NiFi to pull out fields you want. Inside that processor you add a field and then a value like so $.entities.media[0].media_url that runs JsonPath HDF 2.0 Diagram Overview Reference: https://community.hortonworks.com/articles/59349/hdf-20-flow-for-ingesting-real-time-tweets-from-st.html http://jsonpath.com/ https://blogs.apache.org/nifi/entry/indexing_tweets_with_nifi_and https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.EvaluateJsonPath/ https://community.hortonworks.com/questions/21011/how-i-extract-attribute-from-json-file-using-nifi.html https://jsonpath.curiousconcept.com/ https://developers.google.com/sheets/guides/authorizing https://codelabs.developers.google.com/codelabs/sheets-api/#0 https://developers.google.com/sheets/samples/
... View more
Labels:
10-11-2016
09:55 PM
default JKS/TLS password is changeit
... View more
10-11-2016
08:41 PM
TensorFlow 0.11 is out export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.11.0rc0-cp27-none-linux_x86_64.whl
... View more
10-11-2016
06:25 PM
do you have an example zeppelin notebook to share?
... View more
10-07-2016
03:39 PM
3 Kudos
With NiFi 1.00 I ingested a lot of image data from drones, mostly to get metadata like geo. I also ingested a resized version of the image in case I wanted to use it. I found a use for it. I am pulling this out very simply with Spring for a simple HTML page. So I wrote a quick Java program to pull out fields I stored in Phoenix (from the metadata) and I wanted to display the image. I could have streamed it out of HDFS using HDFS libraries to read the file and then stream it. sql = "select datekey, fileName, gPSAltitude, gPSLatitude, gPSLongitude, orientation,geolat,geolong,inception from dronedata1 order by datekey asc";
out.append(STATIC_HEADER);
PreparedStatement ps = connection.prepareStatement(sql);
ResultSet res = ps.executeQuery();
while (res.next()) {
try {
out.append("<br><br>\n<table width=100%><tr><td valign=top><img src=\"");
out.append("http://tspanndev10.field.hortonworks.com:50070/webhdfs/v1/drone/").
append(res.getString("fileName")).append("?op=OPEN\"></td>");
out.append("<td valign=top>Date: ").append(res.getString("datekey"));
out.append("\n<br>Lat: ").append(res.getString("geolat"));
out.append("\n<br>Long: ").append(res.getString("geolong"));
out.append("\n<br><br>\n</td></tr></table>\n");
} catch (Exception e) {
e.printStackTrace();
}
}
It was a lot easier to use the built-in WebHDFS to display an image. Wrapping the Web API call to the image file in an HTML IMG SRC tag loads our image. http://node1:50070/webhdfs/v1/drone/Bebop2_20160920083655-0400.jpg?op=OPEN It's pretty simple and you can use this with a MEAN application, Python Flask or your non-JVM front-end of choice. And now you have a solid distributed host for your images. I recommend this only for internal sites and public images. Having this data publicly available on the cloud is dangerous!
... View more
Labels: