Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Master Guru

Detecting Language with Apache NiFi


This is a work-in-progress as I am still experimenting with some libraries and techniques to improve this. I originally looked at Apache OpenNLP, Apache Tika, optimaize, Deep Learning and some older libraries. It turns out that most of them are defunct or use optimaize. So I am using the Apache Tika wrapper of optimaize for this pass.

I am testing with a few experts to see if we can parse the main languages we need.

JUnit Test

76410-langdetectunittest.png



Add The Processor (First copy to LIB directory and restart the NiFi server) Do not run this in production. Download the NAR from the github.

76411-addthelangdetectprocessor.png

A Quick Flow

76412-langdetectflow.png

It produces an attribute: langdetectTika


So far I have tested with Spanish (es) and English (en).


76413-langdetectattributevalue.png


Source: https://github.com/optimaize/language-detector

JUNit Results

18:30:37.386 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999962696603875]]

18:30:37.389 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999953090550733]]

18:30:37.389 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999942717275939]]

18:30:37.389 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999938953293799]]

18:30:37.390 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999991278041699]]

18:30:37.390 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999961087425597]]

18:30:37.390 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999914584331221]]

Confidence:HIGH

Raw:0.99999523

Attribute:path = target

Attribute:filename = 356189695474847.mockFlowFile

Attribute:langdetectTika = es

Attribute:uuid = 401ef360-53c9-431a-a04c-43b736e3dda1

18:30:37.907 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999913063395092]]

18:30:37.907 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999943343981726]]

18:30:37.907 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999997921395858]]

18:30:37.908 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999900938658981]]

18:30:37.908 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999981049962143]]

18:30:37.908 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999981885752027]]

18:30:37.908 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.999994743498219]]

Confidence:HIGH

Raw:0.99999523

Attribute:path = target

Attribute:filename = 356190481937339.mockFlowFile

Attribute:langdetectTika = en

Attribute:uuid = ce142262-c4b1-4f4c-8dd1-31afd90a0645

References:

There are many SDKs that tap the REST APIs of Google and Microsoft for translation.

https://cloud.google.com/translate/docs/

Optimaize Language Detector - supports 103 language options

https://github.com/optimaize/language-detector/blob/master/README.md

This other libraries hasn't been updated in 4 years and has no Maven repo, so it's on the back burner for now.

https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md

https://github.com/shuyo/language-detection/blob/wiki/Downloads.md

https://github.com/shuyo/language-detection/blob/wiki/Tutorial.md

Many of the other packages want you to train them on corpus of text for all the languages you are interested in.

I downloaded the Apache OpenNLP 1.8.3 language model

OpenNLP has it's own model, but it's not great for small text.

These two have not been updated in nearly 8 years.


Download the NAR

https://github.com/tspannhw/nifi-langdetect-processor/releases/tag/1.6.0

Thanks for a commenter, we are going to investigate Facebooks Fast Text which has good results in some tests: https://github.com/facebookresearch/fastText/ Thanks Alex!

3,703 Views
Comments
avatar
Master Guru

A good use is with RouteOnAttribute or make into a JSON record in a flow file and use in QueryRecord

avatar
Explorer

There is a difference between language-detector & OpenNLP's model - the OpenNLP uses different algorithm for training the classifier, and it should be more precise.

I did evaluation of different language detectors, and have a post about it at http://alexott.blogspot.com/2017/10/evaluating-fasttexts-models-for.html - the most precise results are by fastText-based models.

avatar
Master Guru

Thanks for the information. I'll try fastText.