About TimothySpann

TimothySpann · ‎01-28-2018

If you have Hortonworks support contact them. Otherwise, download the code, built it via maven, then grab the HBase Delete NAR and copy to your Apache NiFi server lib directory (for each Apache NiFi server instance). This is not for people who haven't done some Apache NiFi development.

TimothySpann · ‎01-28-2018

Using SiteToSiteProvenanceReportingTask to Send Provenance to Apache NiFi for Processing. Eating our own provenance food! It's almost comically easy to do this. You set up a task on the server you are reporting on that sends the data to your receiver. That other server you make a simple flow to ingest and process that. I stored it to HBase as JSON as it's a good place to put a lot of data fast. Send The Data You need to create a SiteToSiteProvenanceReportingTask in Controller Settings - Reporting Tasks. It's pretty simple. Set the values above with your destination NiFi server and a port name that you have created already. Receive the Data and Process An Individual JSON Record Split the JSON into Records $.[*] Save to HBase (PutHBaseJSON) First I have to create a table. hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 1.1.2.2.6.2.0-205, r5210d2ed88d7e241646beab51e9ac147a973bdcc, Sat Aug 26 09:33:50 UTC 2017 hbase(main):001:0> create 'PROVENANCE', 'event' 0 row(s) in 2.9900 seconds => Hbase::Table - PROVENANCE scan 'PROVENANCE' ff91e204-05b0-48aa-a666-7942e3f109ab column=event:previousAttributes, timestamp=1517159115042, value={"path":"./","filename":"humidity.583225-583284.log","s2s.address":"192.168.1.197:55032","s2s.host":"1 92.168.1.197","mime.type":"text/plain","uuid":"9006a1bb-d755-4272-b8d3-76e666c2a7c6","tailfile.original.path":"/opt/demo/logs/humidity.log"} ff91e204-05b0-48aa-a666-7942e3f109ab column=event:previousContentURI, timestamp=1517159115042, value=http://192.168.1.193:8080/nifi-api/provenance-events/61825/content/input ff91e204-05b0-48aa-a666-7942e3f109ab column=event:previousEntitySize, timestamp=1517159115042, value=59 ff91e204-05b0-48aa-a666-7942e3f109ab column=event:processGroupId, timestamp=1517159115042, value=01611005-4e82-1491-ae5d-ca64f59491cb ff91e204-05b0-48aa-a666-7942e3f109ab column=event:processGroupName, timestamp=1517159115042, value=Process MiniFi Creator ff91e204-05b0-48aa-a666-7942e3f109ab column=event:timestamp, timestamp=1517159115042, value=2018-01-28T00:25:30.616Z ff91e204-05b0-48aa-a666-7942e3f109ab column=event:timestampMillis, timestamp=1517159115042, value=1517099130616 ff91e204-05b0-48aa-a666-7942e3f109ab column=event:updatedAttributes, timestamp=1517159115042, value={"RouteOnAttribute.Route":"humidity"} ffde140c-3053-4b9d-89c6-14b68025384d column=event:actorHostname, timestamp=1517159114898, value=192.168.1.193 ffde140c-3053-4b9d-89c6-14b68025384d column=event:application, timestamp=1517159114898, value=NiFi Flow ffde140c-3053-4b9d-89c6-14b68025384d column=event:childIds, timestamp=1517159114898, value=[] ffde140c-3053-4b9d-89c6-14b68025384d column=event:componentId, timestamp=1517159114898, value=3a25cda9-0161-1000-813c-631724a10585 ffde140c-3053-4b9d-89c6-14b68025384d column=event:componentName, timestamp=1517159114898, value=RouteOnAttribute ffde140c-3053-4b9d-89c6-14b68025384d column=event:componentType, timestamp=1517159114898, value=RouteOnAttribute ffde140c-3053-4b9d-89c6-14b68025384d column=event:contentURI, timestamp=1517159114898, value=http://192.168.1.193:8080/nifi-api/provenance-events/61701/content/output ffde140c-3053-4b9d-89c6-14b68025384d column=event:durationMillis, timestamp=1517159114898, value=-1 ffde140c-3053-4b9d-89c6-14b68025384d column=event:entityId, timestamp=1517159114898, value=9b017666-7ce9-45c5-9d0a-2f81e56d6fa8 ffde140c-3053-4b9d-89c6-14b68025384d column=event:entitySize, timestamp=1517159114898, value=16 ffde140c-3053-4b9d-89c6-14b68025384d column=event:entityType, timestamp=1517159114898, value=org.apache.nifi.flowfile.FlowFile ffde140c-3053-4b9d-89c6-14b68025384d column=event:eventOrdinal, timestamp=1517159114898, value=61701 ffde140c-3053-4b9d-89c6-14b68025384d column=event:eventType, timestamp=1517159114898, value=ROUTE ffde140c-3053-4b9d-89c6-14b68025384d column=event:lineageStart, timestamp=1517159114898, value=1517084974341 ffde140c-3053-4b9d-89c6-14b68025384d column=event:parentIds, timestamp=1517159114898, value=[] ffde140c-3053-4b9d-89c6-14b68025384d column=event:platform, timestamp=1517159114898, value=nifi ffde140c-3053-4b9d-89c6-14b68025384d column=event:previousAttributes, timestamp=1517159114898, value={"path":"./","filename":"uv.164064-164080.log","s2s.address":"192.168.1.197:55032","s2s.host":"192.168 .1.197","mime.type":"text/plain","uuid":"9b017666-7ce9-45c5-9d0a-2f81e56d6fa8","tailfile.original.path":"/opt/demo/logs/uv.log"} ffde140c-3053-4b9d-89c6-14b68025384d column=event:previousContentURI, timestamp=1517159114898, value=http://192.168.1.193:8080/nifi-api/provenance-events/61701/content/input ffde140c-3053-4b9d-89c6-14b68025384d column=event:previousEntitySize, timestamp=1517159114898, value=16 ffde140c-3053-4b9d-89c6-14b68025384d column=event:processGroupId, timestamp=1517159114898, value=01611005-4e82-1491-ae5d-ca64f59491cb ffde140c-3053-4b9d-89c6-14b68025384d column=event:processGroupName, timestamp=1517159114898, value=Process MiniFi Creator ffde140c-3053-4b9d-89c6-14b68025384d column=event:timestamp, timestamp=1517159114898, value=2018-01-28T00:25:30.607Z ffde140c-3053-4b9d-89c6-14b68025384d column=event:timestampMillis, timestamp=1517159114898, value=1517099130607 ffde140c-3053-4b9d-89c6-14b68025384d column=event:updatedAttributes, timestamp=1517159114898, value={"RouteOnAttribute.Route":"uv"} 1830 row(s) in 11.7680 seconds provenancereporting.xml Learning to Use HBase https://hortonworks.com/hadoop-tutorial/introduction-apache-hbase-concepts-apache-phoenix-new-backup-restore-utility-hbase/

TimothySpann · ‎01-25-2018

Apache NiFi 1.5 with all it's beauty will be in HDF 3.1 which will be available before you know it.

TimothySpann · ‎01-24-2018

if you do a PutHDFS it generates an attribute hive.ddl that can be used to create a hive table. you can also generate hive.ddl with updateattribute with your code ${hive.ddl} LOCATION '${absolute.hdfs.path}'

TimothySpann · ‎01-18-2018

I have just started working on updated Apache Tika and Apache OpenNLP processors for Apache 1.5 and while testing found an interesting workflow I would like to share. I am using a few of my processors in this flow: https://github.com/tspannhw/nifi-attributecleaner-processor <- Just updated this https://github.com/tspannhw/nifi-nlp-processor https://github.com/tspannhw/nifi-extracttext-processor <- this needs a major version refresh https://github.com/tspannhw/nifi-corenlp-processor Here is the flow that I was working on. Step 1 - Load Some PDFs Step 2 - Use the built-in Apache Tika Processor to extract metadata from the files Step 3 - Pull Out the Text using my Apache Tika processor Step 4 - Split this into individual lines Step 5 - Extract out the text of the line into an attribute ((^.*$)) into a sentence Step 6 - Run NLP to analyze for names and locations on that sentence Step 7 - Run Stanford CoreNLP sentiment analysis on the sentence Step 8 - I run my attribute cleaner to turn those attributes into AVRO safe names Step 9 - I turn all the attributes into a JSON Flow File Step 10 - I Infer an Avro schema ( I only needed this once, then I'll remove it) Step 11 - I set the name of the Schema to be looked up from the Schema Registry Step 12 - I run QueryRecord to route POSITIVE, NEURAL and NEGATIVE sentiment to different places. Example SQL: SELECT * FROM FLOWFILE WHERE sentiment = 'NEGATIVE' Thanks Apache Calcite! We also convert from JSON to AVRO for sending to Kafka also for easy conversion to Apache ORC for Apache Hive usage. Step 13-14-15 - I send records to Kafka 1.0, Some get merged to store as a file and some get made into Slack messages. Step 16. Done Here is an example of my generated JSON file. Here are some of the attributes after the run. ' You can see the queries in the QueryRecord processor. The results of a run showing a sentence, file meta data and sentiment. ' We are now waiting for new PDFs (and other file types) to arrive in the directory for immediate processing. I have a JSONTreeReader, a Hortonworks Schema Registry and and AvroRecordSetWriter. We set the properties and the schema register for the reader and writer. Obviously we can use other readers and writers as needed for types like CSV. When I am done, since it's Apache NiFi 1.5, I commit my changes for versioning. Bam! tika.xml For the upcoming processor I will be interfacing with: https://wiki.apache.org/tika/TikaOCR https://wiki.apache.org/tika/TikaAndNER https://wiki.apache.org/tika/TikaAndNLTK https://wiki.apache.org/tika/GrobidQuantitiesParser https://wiki.apache.org/tika/TikaAndMITIE https://wiki.apache.org/tika/AgeDetectionParser https://wiki.apache.org/tika/TikaAndVision https://wiki.apache.org/tika/TikaAndVisionDL4J https://wiki.apache.org/tika/ImageCaption Apache Tika has added some really cool updates, so I can't wait to dive in.

TimothySpann · ‎01-15-2018

For 2018, some awesome new in-demand features have come to my favorite Swiss Army Knife of IoT and Enterprise development, Apache NiFi. Speaking of knives, for fun say, “Apache NiFi” to Google Assistant. Okay, back to the awesome new release of Apache NiFi. So there are a couple of new Processors that I want to highlight first. I am liking the new CountText processor, useful for counting elements of text documents like words and lines. My example flow is using it and I see some useful metrics gathering there. I also think some of these could be used as file validation checks to feed to machine learning algorithms. My files of type X are usually of this # of lines and words, but not this time. I have come across a couple of use cases on file ingest in the past that could use this. In one example a company was reading personnel files from an sFTP. The first step in validation was checking to see they received the proper number of lines, since one person per line. Another occasion sometimes the client would receive bad files in FTP, they looked fine but the last few records in a file would be missing so they needed to meet a minimum number of characters in the file. In yet another they were counting words for legal documents. Example Run text.line.count 1 No value set text.word.count Another cool processor that I will talk about in greater detail in future articles is the much-requested Spark Processor. The ExecuteSparkInteractive processor with its Livy Controller Service gives you a much better alternative to my hacky REST integration https://community.hortonworks.com/articles/148730/integrating-apache-spark-2x-jobs-with-apache-nifi.html to calling Apache Spark batch and machine learning jobs. There are a number of enhancements, new processors and upgrades I’m excited about, but the main reason I am writing today is because of a new feature that allows for having an agile SDLC with Apache NiFi. This is now enabled by Apache NiFi Registry. It’s as simple as a quick git clone or download and then use Apache Maven to install Apache NiFi Registry and start it. This process will become even easier with future Ambari integration for a CLI-free install. To integrate the Registry with Apache NiFi, you need to add a Registry Client. It’s very simple, to add the default local one, see below. Accessing Apache NiFi Registry By default, it will be running here: http://localhost:18080/nifi-registry/ I did a quick install and did not set any security, with the next HDF release everything will be integrated and simple. Accessing Apache NiFi Flow Registry API As is the case with Apache NiFi, there is a great REST API that comes with the new Apache NiFi Registry. This API is very well documented and easy to follow. This will allow for easy integration with all the popular DevOps automation tools which will please all the DevOps focused teams out there. http://localhost:18080/nifi-registry-api/buckets http://localhost:18080/nifi-registry-api/items http://localhost:18080/nifi-registry-api/tenants/user-groups http://localhost:18080/nifi-registry-api/tenants/users http://localhost:18080/nifi-registry-api/policies http://localhost:18080/nifi-registry-api/access Example Output: {"identity":"anonymous","anonymous":true,"resourcePermissions":{"buckets":{"canRead":true,"canWrite":true,"canDelete":true},"tenants":{"canRead":true,"canWrite":true,"canDelete":true},"policies":{"canRead":true,"canWrite":true,"canDelete":true},"proxy":{"canRead":true,"canWrite":true,"canDelete":true},"anyTopLevelResource":{"canRead":true,"canWrite":true,"canDelete":true}}} I added a few buckets to try out. After you have done that, now you can start using it in Apache NiFi. It could not be easier. Step 1: Create or use an existing Processor Group. Step 2: Right click and pick Version – Start version control. You then pick a Registry (if you have more than one) and a bucket. A bucket is a logical categorization of related flows. I created buckets for Development, Testing and Production. You then add a name, description, and comments for this particular flow and then SAVE. You have just versioned a Process Group. You can now run agile team development with Apache NiFi in your enterprise with familiar version control, team development and isolation. You now have a versioned Processor Group: Now you can edit your flow and see that it has changed. You can now easily commit those changes or revert. To see what changed just pick “Show local changes” As you can see you get a very slick display of what changed to what component. Step 3: Now let’s jump to Apache NiFi Registry and see what happened. The above screenshot shows that my flow “Nifi 1.5 Test” has been stored in bucket “Tim” and has 3 saved versions. An Example Versioned Test Flow Now that your flow is version controlled, others can import that into their workspace (depending on security). You can choose from any of the versions based on your needs. For teams, this part is awesome: You will know if there’s a newer version and you can pick that one if you wish. Or not. You can run many copies of the same flow with different variables and versions. My next article will be around updates to integrating with Apache Spark via Apache Livy. Other Steps Change to Another Version Commit Your Local Changes (or Revert Them) Save Your Flow Version to Any Bucket or Registry You have Permissions To Your Variable Registry is Per Versioned Processor Group This is the second version I am saving. Add some comments. New Sub-Project, Processors, Tasks and Services: MoveHDFS Processor Kafka 1.0 Processors CSVRecordLookupService New Graphite Reporting Task Spark Job Executor with Apache Livy Integration FlattenJSON Processor DeleteMongo Processor TextCount Processor Apache NiFi Registry Resources: https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version1.5.0 https://github.com/apache/nifihttps://nifi.apache.org/download.html https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12316020&version=12341668 Apache NiFi Registry https://nifi.apache.org/docs/nifi-registry-docs/index.html https://github.com/apache/nifi-registry#getting-started https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-NiFiRegistry0.1.0

TimothySpann · ‎01-10-2018

In Apache NiFi use the SelectHiveQL processor and then use convertRecord to JSON. when you put it out of the table it's automatically AVRo with a schema. https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html https://community.hortonworks.com/articles/149891/handling-hl7-records-and-storing-in-apache-hive-fo.html

TimothySpann · ‎01-05-2018

You can hit endpoints / NiFi REST API with JMeter. see: https://www.xenonstack.com/blog/data-ingestion-using-apache-nifi-for-building-data-lakes-twitter-data https://community.hortonworks.com/questions/82485/nifi-sizing-benchmark-conditions-and-number-of-sou.html https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.0.0/bk_overview/content/performance-expectations-and-characteristics-of-nifi.html Check out this: https://gist.github.com/ijokarumawak/a0f7023225362e636f31d1376055e67c

TimothySpann · ‎01-05-2018

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_cluster-planning/bk_cluster-planning.pdf https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_cluster-planning/content/ch_hardware-recommendations_chapter.html

TimothySpann · ‎01-04-2018

can you post logs and the error? the better way now is to get the JSON schema (you can use InferAvroSchema if you dont) then just do splitJSON to ConvertRecord no manual coding I have lots of new articles on this one.

Online	Offline
Last Visited	‎05-20-2024 05:42 PM

Member Since	‎01-07-2019 11:58 AM
Last Visited	‎05-20-2024 05:42 PM
Posts	1,973
Kudos received	1122

Cloudera Community

Re: Has anyone tried NiFi consuming (JMSConsume) f...

Re: NiFi Crash after runing chain of lookups

Re: Recommend approach for listening to RSS Feed i...

Re: NiFi ListenFTP Processor Default Data Port

Re: Nifi: Kafka Producer with Avro format in both ...

Re: NiFi - Delete Hbase Row

Provenance Site to Site Reporting - via Apache NiF...

Re: New Features in Apache NiFi 1.5 – Apache NiFi ...

Re: Using HiveQL Processors in Apache NiFi 1.2

Parsing Any Document with Apache NiFi 1.5 with Apa...

New Features in Apache NiFi 1.5 – Apache NiFi Regi...

Re: Converting Hive ORC data to AVRO and JSON form...

Re: Are there currently any tools available for pe...

Re: Hadoop cluster failover

Re: Converting a Large JSON File into CSV