Support Questions

Find answers, ask questions, and share your expertise

What is recommended NLP solution on top of HDP stack for Text Analytics

avatar
Cloudera Employee

What is recommended NLP solution on top of HDP stack for Text Analytics. I know they can use Tika/Stanbol etc for this but are these recommended tech? Anything better than this especially using spark etc?

Use case on-hand is to scan comments ( free text ) and generate insights in the form of recommendations .

1 ACCEPTED SOLUTION

avatar
Guru

There are a range of common NLP systems that work well on the platform. OpenNLP is a java native library which integrates well with, for example map reduce, and of course NLTK being a python system works well with pyspark. There are also native spark elements which are connected to NLP tasks: Latent Dirichlet Allocation for topic detection is one example. Of course the NLTK components also work well with Hive to do things like Tokenisation, and Part of Speech tagging.

Stanford CoreNLP also provides a good toolkit of NLP functions. There is also a spark-package to integrate this with SparkML pipelines.

Solr provides a number of useful tools that apply in the NLP space as well, such as stemming, synonym handling etc as part of its indexing and querying, so provides some building blocks for simple NLP analysis.

There are also a number of commercial and partner solutions which handle NLP tasks.

We are also looking to build tools for Entity Resolution on Spark, which will add to this.

View solution in original post

3 REPLIES 3

avatar
Master Mentor

@Ofer Mendelevith Please see this.

avatar
Guru

There are a range of common NLP systems that work well on the platform. OpenNLP is a java native library which integrates well with, for example map reduce, and of course NLTK being a python system works well with pyspark. There are also native spark elements which are connected to NLP tasks: Latent Dirichlet Allocation for topic detection is one example. Of course the NLTK components also work well with Hive to do things like Tokenisation, and Part of Speech tagging.

Stanford CoreNLP also provides a good toolkit of NLP functions. There is also a spark-package to integrate this with SparkML pipelines.

Solr provides a number of useful tools that apply in the NLP space as well, such as stemming, synonym handling etc as part of its indexing and querying, so provides some building blocks for simple NLP analysis.

There are also a number of commercial and partner solutions which handle NLP tasks.

We are also looking to build tools for Entity Resolution on Spark, which will add to this.

avatar
Contributor

@Simon Elliston Ball is right, there's a huge variety of options for NLP as there are many niches for natural language processing. Keep in mind that NLP libraries rarely directly solve business solutions directly. Rather, they give you the tools to build a solution. Often this is segmenting free text into chunks suitable for analysis (e.g. sentence disambiguation), annotating free text (e.g. part of speech tagging), converting free text to a more structured form (e.g. vectorization). All of these are tools that are useful in processing text, but are insufficient by themselves. These tools help you convert free, unstructured text into a form suitable as input into a normal machine learning or analysis pipeline (i.e. classification, etc.). I suppose the one exception to this that I can think of is sentiment analysis..that is a properly valuable analytic in and of itself.

Also, keep in mind the license for some of these libraries are not as permissive as Apache (e.g. CoreNLP is GPL with the option to purchase a license for commercial use).