1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1914 | 04-03-2024 06:39 AM | |
| 3011 | 01-12-2024 08:19 AM | |
| 1643 | 12-07-2023 01:49 PM | |
| 2420 | 08-02-2023 07:30 AM | |
| 3361 | 03-29-2023 01:22 PM |
01-28-2018
04:16 PM
1 Kudo
If you have Hortonworks support contact them. Otherwise, download the code, built it via maven, then grab the HBase Delete NAR and copy to your Apache NiFi server lib directory (for each Apache NiFi server instance). This is not for people who haven't done some Apache NiFi development.
... View more
01-28-2018
03:57 PM
2 Kudos
Using SiteToSiteProvenanceReportingTask to Send Provenance to Apache NiFi for Processing.
Eating our own provenance food! It's almost comically easy to do this. You set up a task on the server you are reporting on that sends the data to your receiver. That other server you make a simple flow to ingest and process that. I stored it to HBase as JSON as it's a good place to put a lot of data fast.
Send The Data You need to create a SiteToSiteProvenanceReportingTask in Controller Settings - Reporting Tasks. It's pretty simple. Set the values above with your destination NiFi server and a port name that you have created already.
Receive the Data and Process
An Individual JSON Record
Split the JSON into Records
$.[*]
Save to HBase (PutHBaseJSON)
First I have to create a table. hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.1.2.2.6.2.0-205, r5210d2ed88d7e241646beab51e9ac147a973bdcc, Sat Aug 26 09:33:50 UTC 2017
hbase(main):001:0> create 'PROVENANCE', 'event'
0 row(s) in 2.9900 seconds
=> Hbase::Table - PROVENANCE
scan 'PROVENANCE'
ff91e204-05b0-48aa-a666-7942e3f109ab column=event:previousAttributes, timestamp=1517159115042, value={"path":"./","filename":"humidity.583225-583284.log","s2s.address":"192.168.1.197:55032","s2s.host":"1
92.168.1.197","mime.type":"text/plain","uuid":"9006a1bb-d755-4272-b8d3-76e666c2a7c6","tailfile.original.path":"/opt/demo/logs/humidity.log"}
ff91e204-05b0-48aa-a666-7942e3f109ab column=event:previousContentURI, timestamp=1517159115042, value=http://192.168.1.193:8080/nifi-api/provenance-events/61825/content/input
ff91e204-05b0-48aa-a666-7942e3f109ab column=event:previousEntitySize, timestamp=1517159115042, value=59
ff91e204-05b0-48aa-a666-7942e3f109ab column=event:processGroupId, timestamp=1517159115042, value=01611005-4e82-1491-ae5d-ca64f59491cb
ff91e204-05b0-48aa-a666-7942e3f109ab column=event:processGroupName, timestamp=1517159115042, value=Process MiniFi Creator
ff91e204-05b0-48aa-a666-7942e3f109ab column=event:timestamp, timestamp=1517159115042, value=2018-01-28T00:25:30.616Z
ff91e204-05b0-48aa-a666-7942e3f109ab column=event:timestampMillis, timestamp=1517159115042, value=1517099130616
ff91e204-05b0-48aa-a666-7942e3f109ab column=event:updatedAttributes, timestamp=1517159115042, value={"RouteOnAttribute.Route":"humidity"}
ffde140c-3053-4b9d-89c6-14b68025384d column=event:actorHostname, timestamp=1517159114898, value=192.168.1.193
ffde140c-3053-4b9d-89c6-14b68025384d column=event:application, timestamp=1517159114898, value=NiFi Flow
ffde140c-3053-4b9d-89c6-14b68025384d column=event:childIds, timestamp=1517159114898, value=[]
ffde140c-3053-4b9d-89c6-14b68025384d column=event:componentId, timestamp=1517159114898, value=3a25cda9-0161-1000-813c-631724a10585
ffde140c-3053-4b9d-89c6-14b68025384d column=event:componentName, timestamp=1517159114898, value=RouteOnAttribute
ffde140c-3053-4b9d-89c6-14b68025384d column=event:componentType, timestamp=1517159114898, value=RouteOnAttribute
ffde140c-3053-4b9d-89c6-14b68025384d column=event:contentURI, timestamp=1517159114898, value=http://192.168.1.193:8080/nifi-api/provenance-events/61701/content/output
ffde140c-3053-4b9d-89c6-14b68025384d column=event:durationMillis, timestamp=1517159114898, value=-1
ffde140c-3053-4b9d-89c6-14b68025384d column=event:entityId, timestamp=1517159114898, value=9b017666-7ce9-45c5-9d0a-2f81e56d6fa8
ffde140c-3053-4b9d-89c6-14b68025384d column=event:entitySize, timestamp=1517159114898, value=16
ffde140c-3053-4b9d-89c6-14b68025384d column=event:entityType, timestamp=1517159114898, value=org.apache.nifi.flowfile.FlowFile
ffde140c-3053-4b9d-89c6-14b68025384d column=event:eventOrdinal, timestamp=1517159114898, value=61701
ffde140c-3053-4b9d-89c6-14b68025384d column=event:eventType, timestamp=1517159114898, value=ROUTE
ffde140c-3053-4b9d-89c6-14b68025384d column=event:lineageStart, timestamp=1517159114898, value=1517084974341
ffde140c-3053-4b9d-89c6-14b68025384d column=event:parentIds, timestamp=1517159114898, value=[]
ffde140c-3053-4b9d-89c6-14b68025384d column=event:platform, timestamp=1517159114898, value=nifi
ffde140c-3053-4b9d-89c6-14b68025384d column=event:previousAttributes, timestamp=1517159114898, value={"path":"./","filename":"uv.164064-164080.log","s2s.address":"192.168.1.197:55032","s2s.host":"192.168
.1.197","mime.type":"text/plain","uuid":"9b017666-7ce9-45c5-9d0a-2f81e56d6fa8","tailfile.original.path":"/opt/demo/logs/uv.log"}
ffde140c-3053-4b9d-89c6-14b68025384d column=event:previousContentURI, timestamp=1517159114898, value=http://192.168.1.193:8080/nifi-api/provenance-events/61701/content/input
ffde140c-3053-4b9d-89c6-14b68025384d column=event:previousEntitySize, timestamp=1517159114898, value=16
ffde140c-3053-4b9d-89c6-14b68025384d column=event:processGroupId, timestamp=1517159114898, value=01611005-4e82-1491-ae5d-ca64f59491cb
ffde140c-3053-4b9d-89c6-14b68025384d column=event:processGroupName, timestamp=1517159114898, value=Process MiniFi Creator
ffde140c-3053-4b9d-89c6-14b68025384d column=event:timestamp, timestamp=1517159114898, value=2018-01-28T00:25:30.607Z
ffde140c-3053-4b9d-89c6-14b68025384d column=event:timestampMillis, timestamp=1517159114898, value=1517099130607
ffde140c-3053-4b9d-89c6-14b68025384d column=event:updatedAttributes, timestamp=1517159114898, value={"RouteOnAttribute.Route":"uv"}
1830 row(s) in 11.7680 seconds
provenancereporting.xml Learning to Use HBase https://hortonworks.com/hadoop-tutorial/introduction-apache-hbase-concepts-apache-phoenix-new-backup-restore-utility-hbase/
... View more
Labels:
01-25-2018
05:59 PM
Apache NiFi 1.5 with all it's beauty will be in HDF 3.1 which will be available before you know it.
... View more
01-24-2018
07:56 PM
if you do a PutHDFS it generates an attribute hive.ddl that can be used to create a hive table. you can also generate hive.ddl with updateattribute with your code ${hive.ddl} LOCATION '${absolute.hdfs.path}'
... View more
01-18-2018
12:43 AM
6 Kudos
I have just started working on updated Apache Tika and Apache OpenNLP processors for Apache 1.5 and while testing found an interesting workflow I would like to share. I am using a few of my processors in this flow:
https://github.com/tspannhw/nifi-attributecleaner-processor <- Just updated this https://github.com/tspannhw/nifi-nlp-processor https://github.com/tspannhw/nifi-extracttext-processor <- this needs a major version refresh https://github.com/tspannhw/nifi-corenlp-processor Here is the flow that I was working on. Step 1 - Load Some PDFs Step 2 - Use the built-in Apache Tika Processor to extract metadata from the files Step 3 - Pull Out the Text using my Apache Tika processor Step 4 - Split this into individual lines Step 5 - Extract out the text of the line into an attribute ((^.*$)) into a sentence Step 6 - Run NLP to analyze for names and locations on that sentence Step 7 - Run Stanford CoreNLP sentiment analysis on the sentence Step 8 - I run my attribute cleaner to turn those attributes into AVRO safe names Step 9 - I turn all the attributes into a JSON Flow File Step 10 - I Infer an Avro schema ( I only needed this once, then I'll remove it) Step 11 - I set the name of the Schema to be looked up from the Schema Registry Step 12 - I run QueryRecord to route POSITIVE, NEURAL and NEGATIVE sentiment to different places. Example SQL: SELECT * FROM FLOWFILE
WHERE sentiment = 'NEGATIVE' Thanks Apache Calcite! We also convert from JSON to AVRO for sending to Kafka also for easy conversion to Apache ORC for Apache Hive usage. Step 13-14-15 - I send records to Kafka 1.0, Some get merged to store as a file and some get made into Slack messages. Step 16. Done Here is an example of my generated JSON file. Here are some of the attributes after the run. ' You can see the queries in the QueryRecord processor. The results of a run showing a sentence, file meta data and sentiment. ' We are now waiting for new PDFs (and other file types) to arrive in the directory for immediate processing. I have a JSONTreeReader, a Hortonworks Schema Registry and and AvroRecordSetWriter. We set the properties and the schema register for the reader and writer. Obviously we can use other readers and writers as needed for types like CSV. When I am done, since it's Apache NiFi 1.5, I commit my changes for versioning. Bam! tika.xml For the upcoming processor I will be interfacing with:
https://wiki.apache.org/tika/TikaOCR https://wiki.apache.org/tika/TikaAndNER https://wiki.apache.org/tika/TikaAndNLTK https://wiki.apache.org/tika/GrobidQuantitiesParser https://wiki.apache.org/tika/TikaAndMITIE https://wiki.apache.org/tika/AgeDetectionParser https://wiki.apache.org/tika/TikaAndVision https://wiki.apache.org/tika/TikaAndVisionDL4J https://wiki.apache.org/tika/ImageCaption Apache Tika has added some really cool updates, so I can't wait to dive in.
... View more
Labels:
01-15-2018
09:10 PM
8 Kudos
For 2018, some awesome new in-demand features have come to
my favorite Swiss Army Knife of IoT and Enterprise development, Apache NiFi. Speaking of knives, for fun say, “Apache
NiFi” to Google Assistant. Okay, back
to the awesome new release of Apache NiFi. So there are a couple of new Processors that I want to
highlight first. I am liking the new CountText processor, useful for
counting elements of text documents like words and lines. My example flow is using it and I see some
useful metrics gathering there. I also
think some of these could be used as file validation checks to feed to machine
learning algorithms. My files of type X
are usually of this # of lines and words, but not this time. I have come across a couple of use cases on
file ingest in the past that could use this.
In one example a company was reading personnel files from an sFTP. The first step in validation was checking to
see they received the proper number of lines, since one person per line. Another occasion sometimes the client would
receive bad files in FTP, they looked fine but the last few records in a file
would be missing so they needed to meet a minimum number of characters in the
file. In yet another they were counting
words for legal documents. Example Run text.line.count 1
No value set
text.word.count Another cool processor that I will talk about in greater
detail in future articles is the much-requested Spark Processor. The ExecuteSparkInteractive
processor with its Livy Controller Service gives you a much better alternative
to my hacky REST integration https://community.hortonworks.com/articles/148730/integrating-apache-spark-2x-jobs-with-apache-nifi.html
to calling Apache Spark batch and machine learning jobs. There are a number of enhancements, new processors and
upgrades I’m excited about, but the main reason I am writing today is because
of a new feature that allows for having an agile SDLC with Apache NiFi. This is now enabled by Apache NiFi Registry. It’s as simple as a quick git clone or
download and then use Apache Maven to install Apache NiFi Registry and start
it. This process will become even easier
with future Ambari integration for a CLI-free install. To integrate the Registry with Apache NiFi, you need to add
a Registry Client. It’s very simple, to add the default local
one, see below. Accessing Apache NiFi
Registry By default, it will be running here: http://localhost:18080/nifi-registry/ I did a quick install and did not set any security, with the
next HDF release everything will be integrated and simple. Accessing Apache NiFi
Flow Registry API As is the case with Apache NiFi, there is a great REST API
that comes with the new Apache NiFi Registry.
This API is very well documented and easy to follow. This will allow for easy integration with
all the popular DevOps automation tools which will please all the DevOps
focused teams out there. http://localhost:18080/nifi-registry-api/buckets http://localhost:18080/nifi-registry-api/items http://localhost:18080/nifi-registry-api/tenants/user-groups http://localhost:18080/nifi-registry-api/tenants/users http://localhost:18080/nifi-registry-api/policies http://localhost:18080/nifi-registry-api/access Example Output: {"identity":"anonymous","anonymous":true,"resourcePermissions":{"buckets":{"canRead":true,"canWrite":true,"canDelete":true},"tenants":{"canRead":true,"canWrite":true,"canDelete":true},"policies":{"canRead":true,"canWrite":true,"canDelete":true},"proxy":{"canRead":true,"canWrite":true,"canDelete":true},"anyTopLevelResource":{"canRead":true,"canWrite":true,"canDelete":true}}} I added a few buckets to try out. After you have done that, now you can start using it in
Apache NiFi. It could not be easier. Step 1: Create or
use an existing Processor Group. Step 2: Right click and pick Version – Start version control. You then pick a Registry
(if you have more than one) and a bucket. A bucket is a logical categorization of
related flows. I created buckets for
Development, Testing and Production. You
then add a name, description, and comments for this particular flow and then SAVE. You have just versioned a Process Group. You can now run agile team development with
Apache NiFi in your enterprise with familiar version control, team development
and isolation. You now have a versioned Processor Group: Now you can edit your flow and see that it has changed. You can now easily commit those changes or revert. To see what changed just pick “Show local changes” As you can see you get a very slick display of what changed
to what component. Step 3: Now let’s
jump to Apache NiFi Registry and see what happened. The above screenshot shows that my flow “Nifi 1.5 Test” has
been stored in bucket “Tim” and has 3 saved versions. An Example Versioned
Test Flow Now that your flow is version controlled, others can import
that into their workspace (depending on security). You can choose from any of the versions based on your needs. For teams, this part is awesome: You will know if there’s a newer version and you can pick
that one if you wish. Or not. You can run many copies of the same flow
with different variables and versions. My next article will be around updates to integrating with
Apache Spark via Apache Livy. Other Steps Change to Another Version Commit Your Local Changes (or Revert Them) Save Your Flow Version to Any Bucket or Registry You have Permissions To Your Variable Registry is Per Versioned Processor Group This is the second version I am saving. Add some comments. New Sub-Project, Processors,
Tasks and Services: MoveHDFS
Processor Kafka 1.0
Processors CSVRecordLookupService New
Graphite Reporting Task Spark
Job Executor with Apache Livy Integration FlattenJSON
Processor DeleteMongo
Processor TextCount
Processor Apache
NiFi Registry Resources:
https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version1.5.0 https://github.com/apache/nifihttps://nifi.apache.org/download.html https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12316020&version=12341668 Apache NiFi Registry
https://nifi.apache.org/docs/nifi-registry-docs/index.html https://github.com/apache/nifi-registry#getting-started https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-NiFiRegistry0.1.0
... View more
Labels:
01-10-2018
05:17 AM
In Apache NiFi use the SelectHiveQL processor and then use convertRecord to JSON. when you put it out of the table it's automatically AVRo with a schema. https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html https://community.hortonworks.com/articles/149891/handling-hl7-records-and-storing-in-apache-hive-fo.html
... View more
01-05-2018
02:34 PM
You can hit endpoints / NiFi REST API with JMeter. see: https://www.xenonstack.com/blog/data-ingestion-using-apache-nifi-for-building-data-lakes-twitter-data https://community.hortonworks.com/questions/82485/nifi-sizing-benchmark-conditions-and-number-of-sou.html https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.0.0/bk_overview/content/performance-expectations-and-characteristics-of-nifi.html Check out this: https://gist.github.com/ijokarumawak/a0f7023225362e636f31d1376055e67c
... View more
01-05-2018
01:46 PM
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_cluster-planning/bk_cluster-planning.pdf https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_cluster-planning/content/ch_hardware-recommendations_chapter.html
... View more
01-04-2018
08:42 PM
can you post logs and the error? the better way now is to get the JSON schema (you can use InferAvroSchema if you dont) then just do splitJSON to ConvertRecord no manual coding I have lots of new articles on this one.
... View more