Created on 05-25-2018 08:07 PM - edited 08-17-2019 07:18 AM
Detecting Language with Apache NiFi
This is a work-in-progress as I am still experimenting with some libraries and techniques to improve this. I originally looked at Apache OpenNLP, Apache Tika, optimaize, Deep Learning and some older libraries. It turns out that most of them are defunct or use optimaize. So I am using the Apache Tika wrapper of optimaize for this pass.
I am testing with a few experts to see if we can parse the main languages we need.
JUnit Test
Add The Processor (First copy to LIB directory and restart the NiFi server) Do not run this in production. Download the NAR from the github.
A Quick Flow
It produces an attribute: langdetectTika
So far I have tested with Spanish (es) and English (en).
Source: https://github.com/optimaize/language-detector
JUNit Results
18:30:37.386 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999962696603875]] 18:30:37.389 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999953090550733]] 18:30:37.389 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999942717275939]] 18:30:37.389 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999938953293799]] 18:30:37.390 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999991278041699]] 18:30:37.390 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999961087425597]] 18:30:37.390 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999914584331221]] Confidence:HIGH Raw:0.99999523 Attribute:path = target Attribute:filename = 356189695474847.mockFlowFile Attribute:langdetectTika = es Attribute:uuid = 401ef360-53c9-431a-a04c-43b736e3dda1 18:30:37.907 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999913063395092]] 18:30:37.907 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999943343981726]] 18:30:37.907 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999997921395858]] 18:30:37.908 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999900938658981]] 18:30:37.908 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999981049962143]] 18:30:37.908 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999981885752027]] 18:30:37.908 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.999994743498219]] Confidence:HIGH Raw:0.99999523 Attribute:path = target Attribute:filename = 356190481937339.mockFlowFile Attribute:langdetectTika = en Attribute:uuid = ce142262-c4b1-4f4c-8dd1-31afd90a0645
References:
There are many SDKs that tap the REST APIs of Google and Microsoft for translation.
https://cloud.google.com/translate/docs/
Optimaize Language Detector - supports 103 language options
https://github.com/optimaize/language-detector/blob/master/README.md
This other libraries hasn't been updated in 4 years and has no Maven repo, so it's on the back burner for now.
https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md
https://github.com/shuyo/language-detection/blob/wiki/Downloads.md
https://github.com/shuyo/language-detection/blob/wiki/Tutorial.md
Many of the other packages want you to train them on corpus of text for all the languages you are interested in.
I downloaded the Apache OpenNLP 1.8.3 language model
OpenNLP has it's own model, but it's not great for small text.
These two have not been updated in nearly 8 years.
Download the NAR
https://github.com/tspannhw/nifi-langdetect-processor/releases/tag/1.6.0
Thanks for a commenter, we are going to investigate Facebooks Fast Text which has good results in some tests: https://github.com/facebookresearch/fastText/ Thanks Alex!
Created on 05-26-2018 01:48 AM
A good use is with RouteOnAttribute or make into a JSON record in a flow file and use in QueryRecord
Created on 05-30-2018 08:09 AM
There is a difference between language-detector & OpenNLP's model - the OpenNLP uses different algorithm for training the classifier, and it should be more precise.
I did evaluation of different language detectors, and have a post about it at http://alexott.blogspot.com/2017/10/evaluating-fasttexts-models-for.html - the most precise results are by fastText-based models.
Created on 05-30-2018 11:20 AM
Thanks for the information. I'll try fastText.