Community Articles

TimothySpann · ‎05-25-2018

Detecting Language with Apache NiFi

This is a work-in-progress as I am still experimenting with some libraries and techniques to improve this. I originally looked at Apache OpenNLP, Apache Tika, optimaize, Deep Learning and some older libraries. It turns out that most of them are defunct or use optimaize. So I am using the Apache Tika wrapper of optimaize for this pass.

I am testing with a few experts to see if we can parse the main languages we need.

JUnit Test

Add The Processor (First copy to LIB directory and restart the NiFi server) Do not run this in production. Download the NAR from the github.

A Quick Flow

It produces an attribute: langdetectTika

So far I have tested with Spanish (es) and English (en).

Source: https://github.com/optimaize/language-detector

JUNit Results

18:30:37.386 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999962696603875]]

18:30:37.389 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999953090550733]]

18:30:37.389 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999942717275939]]

18:30:37.389 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999938953293799]]

18:30:37.390 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999991278041699]]

18:30:37.390 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999961087425597]]

18:30:37.390 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999914584331221]]

Confidence:HIGH

Raw:0.99999523

Attribute:path = target

Attribute:filename = 356189695474847.mockFlowFile

Attribute:langdetectTika = es

Attribute:uuid = 401ef360-53c9-431a-a04c-43b736e3dda1

18:30:37.907 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999913063395092]]

18:30:37.907 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999943343981726]]

18:30:37.907 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999997921395858]]

18:30:37.908 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999900938658981]]

18:30:37.908 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999981049962143]]

18:30:37.908 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999981885752027]]

18:30:37.908 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.999994743498219]]

Confidence:HIGH

Raw:0.99999523

Attribute:path = target

Attribute:filename = 356190481937339.mockFlowFile

Attribute:langdetectTika = en

Attribute:uuid = ce142262-c4b1-4f4c-8dd1-31afd90a0645

References:

There are many SDKs that tap the REST APIs of Google and Microsoft for translation.

https://cloud.google.com/translate/docs/

Optimaize Language Detector - supports 103 language options

https://github.com/optimaize/language-detector/blob/master/README.md

This other libraries hasn't been updated in 4 years and has no Maven repo, so it's on the back burner for now.

https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md

https://github.com/shuyo/language-detection/blob/wiki/Downloads.md

https://github.com/shuyo/language-detection/blob/wiki/Tutorial.md

Many of the other packages want you to train them on corpus of text for all the languages you are interested in.

I downloaded the Apache OpenNLP 1.8.3 language model

OpenNLP has it's own model, but it's not great for small text.

These two have not been updated in nearly 8 years.

Download the NAR

https://github.com/tspannhw/nifi-langdetect-processor/releases/tag/1.6.0

Thanks for a commenter, we are going to investigate Facebooks Fast Text which has good results in some tests: https://github.com/facebookresearch/fastText/ Thanks Alex!

TimothySpann · ‎05-26-2018

A good use is with RouteOnAttribute or make into a JSON record in a flow file and use in QueryRecord

AlexOtt · ‎05-30-2018

There is a difference between language-detector & OpenNLP's model - the OpenNLP uses different algorithm for training the classifier, and it should be more precise.

I did evaluation of different language detectors, and have a post about it at http://alexott.blogspot.com/2017/10/evaluating-fasttexts-models-for.html - the most precise results are by fastText-based models.

TimothySpann · ‎05-30-2018

Thanks for the information. I'll try fastText.

Cloudera Community

Community Articles

Detecting Language with Apache NiFi

Apache NiFi

Re: Detecting Language with Apache NiFi

Re: Detecting Language with Apache NiFi

Re: Detecting Language with Apache NiFi