Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Super Guru

I have just started working on updated Apache Tika and Apache OpenNLP processors for Apache 1.5 and while testing found an interesting workflow I would like to share.

I am using a few of my processors in this flow:

Here is the flow that I was working on.

54384-flowpart1.png

Step 1 - Load Some PDFs

Step 2 - Use the built-in Apache Tika Processor to extract metadata from the files

Step 3 - Pull Out the Text using my Apache Tika processor

Step 4 - Split this into individual lines

Step 5 - Extract out the text of the line into an attribute ((^.*$)) into a sentence

Step 6 - Run NLP to analyze for names and locations on that sentence

Step 7 - Run Stanford CoreNLP sentiment analysis on the sentence

54385-flowpart2.png

Step 8 - I run my attribute cleaner to turn those attributes into AVRO safe names

Step 9 - I turn all the attributes into a JSON Flow File

Step 10 - I Infer an Avro schema ( I only needed this once, then I'll remove it)

Step 11 - I set the name of the Schema to be looked up from the Schema Registry

Step 12 - I run QueryRecord to route POSITIVE, NEURAL and NEGATIVE sentiment to different places. Example SQL: SELECT * FROM FLOWFILE WHERE sentiment = 'NEGATIVE' Thanks Apache Calcite! We also convert from JSON to AVRO for sending to Kafka also for easy conversion to Apache ORC for Apache Hive usage.

Step 13-14-15 - I send records to Kafka 1.0, Some get merged to store as a file and some get made into Slack messages.

Step 16. Done

54386-oneline.png

Here is an example of my generated JSON file.

54387-nlpvalues.png

Here are some of the attributes after the run.

54388-queryrecord.png'

You can see the queries in the QueryRecord processor.

54389-attributes.png

The results of a run showing a sentence, file meta data and sentiment.

54390-listfilestate.png'

We are now waiting for new PDFs (and other file types) to arrive in the directory for immediate processing.

54392-readerwriterregistry.png

I have a JSONTreeReader, a Hortonworks Schema Registry and and AvroRecordSetWriter.

54393-jsontreereaderproperties.png

54394-avrorecordsetwriterproperties.png

We set the properties and the schema register for the reader and writer. Obviously we can use other readers and writers as needed for types like CSV.

54395-tikawithversioning.png

When I am done, since it's Apache NiFi 1.5, I commit my changes for versioning.

Bam!

tika.xml

For the upcoming processor I will be interfacing with:

Apache Tika has added some really cool updates, so I can't wait to dive in.


readerwriterregistry.pngqueryrecordsql.png
3,975 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 09:22 AM
Updated by:
 
Contributors
Top Kudoed Authors