About TimothySpann

TimothySpann · ‎11-01-2016

For C# and .NET https://github.com/hdinsight/hbase-sdk-for-net https://hbasenet.codeplex.com/ For C# Spark https://github.com/Microsoft/Mobius

TimothySpann · ‎10-31-2016

See: https://community.hortonworks.com/articles/64069/converting-a-large-json-file-into-csv.html

TimothySpann · ‎10-30-2016

INGESTING RDBMS DATA I previously posted an article on ingesting and converting data (https://community.hortonworks.com/articles/64069/converting-a-large-json-file-into-csv.html). Once you have a SQL database loaded, you will eventually need to store your data in your one unified datalake. This is quite simple with NiFi. If you have a specialized tool that reads from your RDBMS logs and sends them to Kafka or JMS, that would be easy to ingest as well. For those wishing to stay open source, NiFi works great. If you don't have a good increasing key to use, you can add an article one that increases on every insert. Almost every database supports this from MariaDB to Oracle. ALTER TABLE `useraccount` ADD COLUMN `id` INT AUTO_INCREMENT UNIQUE FIRST; For mine, I just added an autoincrement id column to be my trigger. For Apache NiFi, you will need connections to all your sources and sinks. So I need a DB Connection Pool for Apache Phoenix and MySQL (DBCPConnectionPool) as well as Hive (HiveConnectionPool). Tools Required: RDMS (I am using MySQL) HDF 2.0 (NiFi 1.0.0+) HDP 2.4+ (I am using HDP 2.5) with HBase and Phoenix enabled and running, HDFS, YARN and Hive running. Optional: Apache Zeppelin for quick data analysis and validation To build a SQL database, I needed a source of interesting and plentiful data. So I used the excellent free API: https://api.randomuser.me/. It's easy to get this URL to return 5,000 formatted JSON results via the extra parameters: ?results=3&format=pretty. This API returns JSON in this format that requires some basic transformation (easily done in NiFi). {"results":[ {"gender":"male", "name":{"title":"monsieur","first":"lohan","last":"marchand"}, "location":{"street":"6684 rue jean-baldassini","city":"auboranges","state":"schwyz","postcode":9591}, "email":"lohan.marchand@example.com", "login":{"username":"biggoose202","password":"esther","salt":"QIU1HBsr","md5":"9e60da6d4490cd6d102e8010ac98f283","sha1":"3de3ea419da1afe5c83518f8b46f157895266d17","sha256":"c6750c1a5bd18cac01c63d9e58a57d75520861733666ddb7ea6e767a7460479b"}, "dob":"1965-01-28 03:56:58", "registered":"2014-07-26 11:06:46", "phone":"(849)-890-5523", "cell":"(395)-127-9369", "id":{"name":"AVS","value":"756.OUVK.GFAB.51"}, "picture":{"large":"https://randomuser.me/api/portraits/men/69.jpg","medium":"https://randomuser.me/api/portraits/med/men/69.jpg","thumbnail":"https://randomuser.me/api/portraits/thumb/men/69.jpg"},"nat":"CH"}] Then I created a MySQL table to populate with JSON data. drop table useraccount; create table useraccount( gender varchar(200), title varchar(200), first varchar(200), last varchar(200), street varchar(200), city varchar(200), state varchar(200), postcode varchar(200), email varchar(200), username varchar(200), password varchar(200), salt varchar(200), md5 varchar(200), sha1 varchar(200), sha256 varchar(200), dob varchar(200), registered varchar(200), phone varchar(200), cell varchar(200), name varchar(200), value varchar(200), large varchar(200), medium varchar(200), thumbnail varchar(200), nat varchar(200)); I created a Phoenix table ontop of HBase to hold data: create table useraccount( gender varchar, title varchar, firstname varchar, lastname varchar, street varchar, city varchar, state varchar, postcode varchar, email varchar, username varchar, password varchar, salt varchar, md5 varchar not null primary key, sha1 varchar, sha256 varchar, dob varchar, registered varchar, phone varchar, cell varchar, name varchar, value2 varchar, large varchar, medium varchar, thumbnail varchar, nat varchar); Step 1: QueryDatabaseTable Reads from MySQL tables. This processor just needs the MySQL Connection, table name: useraccount and column: id. With have two forks from this query table. Fork 1 Step 2: ConvertAvroToJSON Use Array You will get arrays of JSON that look like this: { "id" : 656949, "gender" : "female", "title" : "madame", "first" : "amandine", "last" : "sanchez", "street" : "8604 place paul-duquaire", "city" : "savigny", "state" : "genÃ¨ve", "postcode" : "5909", "email" : "amandine.sanchez@example.com", "username" : "ticklishmeercat183", "password" : "hillary", "salt" : "Sgq7HHP1", "md5" : "d82d6c3524f3a1118399113e6c43ed31", "sha1" : "23ce2b372f94d39fb949d95e81e82bece1e06a4a", "sha256" : "49d7e92a2815df1d5fd991ce9ebbbcdffee4e0e7fe398bc32f0331894cae1154", "dob" : "1983-05-22 15:16:49", "registered" : "2011-02-06 22:03:37", "phone" : "(518)-683-8709", "cell" : "(816)-306-5232", "name" : "AVS", "value" : "756.IYWK.GJBH.35", "large" : "https://randomuser.me/api/portraits/women/50.jpg", "medium" : "https://randomuser.me/api/portraits/med/women/50.jpg", "thumbnail" : "https://randomuser.me/api/portraits/thumb/women/50.jpg", "nat" : "CH" } Step 3: SplitJSON Use: $.* to split all the arrays into individual JSON records. Step 4: EvaluateJSONPath You need to pull out each attribute you want and name it, example cell for $.cell See the guide to JSONPath with testing tool here. Step 5: ReplaceText Here we format the SQL from the attributes we just parsed from JSON: upsert into useraccount (gender,title,firstname,lastname,street,city,state,postcode,email, username,password,salt,md5,sha1,sha256,dob,registered,phone,cell,name,value2,large,medium,thumbnail,nat) values ('${'gender'}','${'title'}','${'first'}','${'last'}','${'street'}','${'city'}','${'state'}','${'postcode'}', '${'email'}','${'username'}','${'password'}','${'salt'}','${'md5'}','${'sha1'}','${'sha256'}','${'dob'}', '${'registered'}','${'phone'}','${'cell'}','${'name'}','${'value'}','${'large'}','${'medium'}','${'thumbnail'}','${'nat'}' ) Step 6: PutSQL With an example Batch Size of 100, we connect to our Phoenix DB Connection Pool. Fork 2 Step 2: UpdateAttribute We set orc.table to useraccount Step 3: ConvertAvroToORC We set our configuration files for Hive: /etc/hive/conf/hive-site.xml, 64MB stripe, and importantly Hive Table Name to ${orc.table} Step 4: PutHDFS Set out configuration /etc/hadoop/conf/core-site.xml and a directory you have access to write to for storing the ORC files. Step 5: ReplaceText Search Value: (?s:^.*$) Replacement Value: ${hive.ddl} LOCATION '${absolute.hdfs.path}' Always replace and entire text. Step 6: PutHiveQL You need to connect to your Hive Connection. You will see the resulting ORC files in your HDFS directory [root@tspanndev12 demo]# hdfs dfs -ls /orcdata Found 2 items -rw-r--r-- 3 root hdfs 246806 2016-10-29 01:24 /orcdata/2795061363634412.orc -rw-r--r-- 3 root hdfs 246829 2016-10-29 17:25 /orcdata/2852682816977877.orc After my first few batches of data are ingested, I check them in Apache Zeppelin. Looks good. The data has also been loaded into Apache Hive.

TimothySpann · ‎10-29-2016

do you see anything in data provenance at any of the components?

TimothySpann · ‎10-29-2016

can you post the logs? it should have more details. Is SOLR and Banana installed and running? Can you check those? What component is that error on? Do you have the twitter setup? Did you create a twitter app with Read/Write/Direct Message and grabbed those four values?

TimothySpann · ‎10-28-2016

I have 60,000 flow files in a PutSQL processor and I know the SQL is invalid and it is failing in chunks and waiting 1 second and moving through. I was able to empty out all the backed up queues, but this processor is slowly parsing these failed files. I want to purge them out, how can I do this? It's a waste to wait knowing all these will fail.

TimothySpann · ‎10-28-2016

Apache NiFi 1.0.0 has a lot of cool features, but no JSON-to-CSV converter yet. For my use case, I wanted to split one JSON file that had 5,000 records into 5,000 CSVs. You could also through a MergeContent processor in there and make one file. My data came from the excellent test data source, RandomUser. They provide a free API to pull down data, https://api.randomuser.me/?results=5000&format=pretty. I chose pretty to get formatted JSON with multiple lines which is easier to parse. This is an example of JSON returned from that service: {"results":[ {"gender":"male", "name":{"title":"monsieur","first":"lohan","last":"marchand"}, "location":{"street":"6684 rue jean-baldassini","city":"auboranges","state":"schwyz","postcode":9591}, "email":"lohan.marchand@example.com", "login":{"username":"biggoose202","password":"esther","salt":"QIU1HBsr","md5":"9e60da6d4490cd6d102e8010ac98f283","sha1":"3de3ea419da1afe5c83518f8b46f157895266d17","sha256":"c6750c1a5bd18cac01c63d9e58a57d75520861733666ddb7ea6e767a7460479b"}, "dob":"1965-01-28 03:56:58", "registered":"2014-07-26 11:06:46", "phone":"(849)-890-5523", "cell":"(395)-127-9369", "id":{"name":"AVS","value":"756.OUVK.GFAB.51"}, "picture":{"large":"https://randomuser.me/api/portraits/men/69.jpg","medium":"https://randomuser.me/api/portraits/med/men/69.jpg","thumbnail":"https://randomuser.me/api/portraits/thumb/men/69.jpg"}, "nat":"CH"} ] Step 1: GetFile: Read the JSON file from a directory (I can process any JSON file in there) Step 2: SplitJSON: Split on $.results to split array of JSON. Step 3: EvaluateJSONPath: Pull out attributes from the JSON record sent. Example: $.cell Step 4: UpdateAttribute: Update the filename to be unique ${md5}.csv. Step 5: ReplaceText: To format JSON attributes into a line of command-separated values. Step 6: PutFile: Store the resulting file in a directory. (This could also be PutHDFS or many other sink processors).

TimothySpann · ‎10-28-2016

Use EvaluateJsonPath Field Style is totalNonHeap =>$.systemDiagnostics.aggregateSnapshot.totalNonHeap status => $.systemDiagnostics.aggregateSnapshot.status country => $.country Then ReplaceText to format as CSV ${'status'},'${'country'}','${'totalNonHeap'}' Then PutFile to write a file http://jsonpath.com/ http://goessner.net/articles/JsonPath/

TimothySpann · ‎10-28-2016

Could something be down with JoltTransformJSON https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.JoltTransformJSON/index.html could you augment the one main flow by update attribute that reads from DistributedMapCache if it's a small lookup: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.PutDistributedMapCache/index.html I like the idea of a mergeJSON processor though

TimothySpann · ‎10-28-2016

I would like to get a list of all provenance events https://nifi.apache.org/docs/nifi-docs/rest-api/index.html Ideally I would like to ingest them with NIFI and store them in HDFS. Is there an example of POST /provenance?

Online	Offline
Last Visited	‎05-20-2024 05:42 PM

Member Since	‎01-07-2019 11:58 AM
Last Visited	‎05-20-2024 05:42 PM
Posts	1,973
Kudos received	1122

Cloudera Community

Re: Has anyone tried NiFi consuming (JMSConsume) f...

Re: NiFi Crash after runing chain of lookups

Re: Recommend approach for listening to RSS Feed i...

Re: NiFi ListenFTP Processor Default Data Port

Re: Nifi: Kafka Producer with Avro format in both ...

Re: Can we use Hbase/Cassandra as database for Asp...

Re: Convert JSON to CSV using Nifi

Incrementally Streaming RDBMS Data to Your Hadoop ...

Re: Unable to visualize tweets on Banana.

Re: Unable to visualize tweets on Banana.

PutSQL with errors

Converting a Large JSON File into CSV

Re: Convert JSON to CSV using Nifi

Re: NiFi: Creating a single JSON from multiple SQL...

What is the Best Practice for Querying NiFi 1.0.0 ...