Created on 05-25-201808:07 PM - edited 08-17-201907:18 AM
Detecting Language with Apache NiFi
This is a work-in-progress as I am still experimenting with some libraries and techniques to improve this. I originally looked at Apache OpenNLP, Apache Tika, optimaize, Deep Learning and some older libraries. It turns out that most of them are defunct or use optimaize. So I am using the Apache Tika wrapper of optimaize for this pass.
I am testing with a few experts to see if we can parse the main languages we need.
Add The Processor (First copy to LIB directory and restart the NiFi server) Do not run this in production. Download the NAR from the github.
A Quick Flow
It produces an attribute: langdetectTika
So far I have tested with Spanish (es) and English (en).