Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Super Guru

Detecting Language with Apache NiFi


This is a work-in-progress as I am still experimenting with some libraries and techniques to improve this. I originally looked at Apache OpenNLP, Apache Tika, optimaize, Deep Learning and some older libraries. It turns out that most of them are defunct or use optimaize. So I am using the Apache Tika wrapper of optimaize for this pass.

I am testing with a few experts to see if we can parse the main languages we need.

JUnit Test

76410-langdetectunittest.png



Add The Processor (First copy to LIB directory and restart the NiFi server) Do not run this in production. Download the NAR from the github.

76411-addthelangdetectprocessor.png

A Quick Flow

76412-langdetectflow.png

It produces an attribute: langdetectTika


So far I have tested with Spanish (es) and English (en).


76413-langdetectattributevalue.png


Source: https://github.com/optimaize/language-detector

JUNit Results

18:30:37.386 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999962696603875]]

18:30:37.389 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999953090550733]]

18:30:37.389 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999942717275939]]

18:30:37.389 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999938953293799]]

18:30:37.390 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999991278041699]]

18:30:37.390 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999961087425597]]

18:30:37.390 [pool-1-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[es:0.9999914584331221]]

Confidence:HIGH

Raw:0.99999523

Attribute:path = target

Attribute:filename = 356189695474847.mockFlowFile

Attribute:langdetectTika = es

Attribute:uuid = 401ef360-53c9-431a-a04c-43b736e3dda1

18:30:37.907 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999913063395092]]

18:30:37.907 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999943343981726]]

18:30:37.907 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999997921395858]]

18:30:37.908 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999900938658981]]

18:30:37.908 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999981049962143]]

18:30:37.908 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.9999981885752027]]

18:30:37.908 [pool-2-thread-1] DEBUG com.optimaize.langdetect.LanguageDetectorImpl - ==> [DetectedLanguage[en:0.999994743498219]]

Confidence:HIGH

Raw:0.99999523

Attribute:path = target

Attribute:filename = 356190481937339.mockFlowFile

Attribute:langdetectTika = en

Attribute:uuid = ce142262-c4b1-4f4c-8dd1-31afd90a0645

References:

There are many SDKs that tap the REST APIs of Google and Microsoft for translation.

https://cloud.google.com/translate/docs/

Optimaize Language Detector - supports 103 language options

https://github.com/optimaize/language-detector/blob/master/README.md

This other libraries hasn't been updated in 4 years and has no Maven repo, so it's on the back burner for now.

https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md

https://github.com/shuyo/language-detection/blob/wiki/Downloads.md

https://github.com/shuyo/language-detection/blob/wiki/Tutorial.md

Many of the other packages want you to train them on corpus of text for all the languages you are interested in.

I downloaded the Apache OpenNLP 1.8.3 language model

OpenNLP has it's own model, but it's not great for small text.

These two have not been updated in nearly 8 years.


Download the NAR

https://github.com/tspannhw/nifi-langdetect-processor/releases/tag/1.6.0

Thanks for a commenter, we are going to investigate Facebooks Fast Text which has good results in some tests: https://github.com/facebookresearch/fastText/ Thanks Alex!

805 Views
Comments
Super Guru

A good use is with RouteOnAttribute or make into a JSON record in a flow file and use in QueryRecord

New Contributor

There is a difference between language-detector & OpenNLP's model - the OpenNLP uses different algorithm for training the classifier, and it should be more precise.

I did evaluation of different language detectors, and have a post about it at http://alexott.blogspot.com/2017/10/evaluating-fasttexts-models-for.html - the most precise results are by fastText-based models.

Super Guru

Thanks for the information. I'll try fastText.

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 07:18 AM
Updated by:
 
Contributors
Top Kudoed Authors