Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What is recommended NLP solution on top of HDP stack for Text Analytics

Solved Go to solution
Highlighted

What is recommended NLP solution on top of HDP stack for Text Analytics

Cloudera Employee

What is recommended NLP solution on top of HDP stack for Text Analytics. I know they can use Tika/Stanbol etc for this but are these recommended tech? Anything better than this especially using spark etc?

Use case on-hand is to scan comments ( free text ) and generate insights in the form of recommendations .

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: What is recommended NLP solution on top of HDP stack for Text Analytics

Guru

There are a range of common NLP systems that work well on the platform. OpenNLP is a java native library which integrates well with, for example map reduce, and of course NLTK being a python system works well with pyspark. There are also native spark elements which are connected to NLP tasks: Latent Dirichlet Allocation for topic detection is one example. Of course the NLTK components also work well with Hive to do things like Tokenisation, and Part of Speech tagging.

Stanford CoreNLP also provides a good toolkit of NLP functions. There is also a spark-package to integrate this with SparkML pipelines.

Solr provides a number of useful tools that apply in the NLP space as well, such as stemming, synonym handling etc as part of its indexing and querying, so provides some building blocks for simple NLP analysis.

There are also a number of commercial and partner solutions which handle NLP tasks.

We are also looking to build tools for Entity Resolution on Spark, which will add to this.

View solution in original post

3 REPLIES 3

Re: What is recommended NLP solution on top of HDP stack for Text Analytics

@Ofer Mendelevith Please see this.

Highlighted

Re: What is recommended NLP solution on top of HDP stack for Text Analytics

Guru

There are a range of common NLP systems that work well on the platform. OpenNLP is a java native library which integrates well with, for example map reduce, and of course NLTK being a python system works well with pyspark. There are also native spark elements which are connected to NLP tasks: Latent Dirichlet Allocation for topic detection is one example. Of course the NLTK components also work well with Hive to do things like Tokenisation, and Part of Speech tagging.

Stanford CoreNLP also provides a good toolkit of NLP functions. There is also a spark-package to integrate this with SparkML pipelines.

Solr provides a number of useful tools that apply in the NLP space as well, such as stemming, synonym handling etc as part of its indexing and querying, so provides some building blocks for simple NLP analysis.

There are also a number of commercial and partner solutions which handle NLP tasks.

We are also looking to build tools for Entity Resolution on Spark, which will add to this.

View solution in original post

Highlighted

Re: What is recommended NLP solution on top of HDP stack for Text Analytics

Explorer

@Simon Elliston Ball is right, there's a huge variety of options for NLP as there are many niches for natural language processing. Keep in mind that NLP libraries rarely directly solve business solutions directly. Rather, they give you the tools to build a solution. Often this is segmenting free text into chunks suitable for analysis (e.g. sentence disambiguation), annotating free text (e.g. part of speech tagging), converting free text to a more structured form (e.g. vectorization). All of these are tools that are useful in processing text, but are insufficient by themselves. These tools help you convert free, unstructured text into a form suitable as input into a normal machine learning or analysis pipeline (i.e. classification, etc.). I suppose the one exception to this that I can think of is sentiment analysis..that is a properly valuable analytic in and of itself.

Also, keep in mind the license for some of these libraries are not as permissive as Apache (e.g. CoreNLP is GPL with the option to purchase a license for commercial use).

Don't have an account?
Coming from Hortonworks? Activate your account here