Step 2 - Use the built-in Apache Tika Processor to extract metadata from the files
Step 3 - Pull Out the Text using my Apache Tika processor
Step 4 - Split this into individual lines
Step 5 - Extract out the text of the line into an attribute ((^.*$)) into a sentence
Step 6 - Run NLP to analyze for names and locations on that sentence
Step 7 - Run Stanford CoreNLP sentiment analysis on the sentence
Step 8 - I run my attribute cleaner to turn those attributes into AVRO safe names
Step 9 - I turn all the attributes into a JSON Flow File
Step 10 - I Infer an Avro schema ( I only needed this once, then I'll remove it)
Step 11 - I set the name of the Schema to be looked up from the Schema Registry
Step 12 - I run QueryRecord to route POSITIVE, NEURAL and NEGATIVE sentiment to different places. Example SQL: SELECT * FROM FLOWFILE
WHERE sentiment = 'NEGATIVE' Thanks Apache Calcite! We also convert from JSON to AVRO for sending to Kafka also for easy conversion to Apache ORC for Apache Hive usage.
Step 13-14-15 - I send records to Kafka 1.0, Some get merged to store as a file and some get made into Slack messages.
Step 16. Done
Here is an example of my generated JSON file.
Here are some of the attributes after the run.
You can see the queries in the QueryRecord processor.
The results of a run showing a sentence, file meta data and sentiment.
We are now waiting for new PDFs (and other file types) to arrive in the directory for immediate processing.
I have a JSONTreeReader, a Hortonworks Schema Registry and and AvroRecordSetWriter.
We set the properties and the schema register for the reader and writer. Obviously we can use other readers and writers as needed for types like CSV.
When I am done, since it's Apache NiFi 1.5, I commit my changes for versioning.