Support Questions

ssanku · ‎07-28-2016

Hi,

I am looking for a reference and demos to show Text & Data mining capabilities on our platform.

I am trying to answer one of the RFP questions.

Any help is highly appreciated.

Thanks,

Sujitha

dzaratsian · ‎07-29-2016

I've placed a few pyspark scripts on my github: https://github.com/zaratsian/pyspark. You can demo/show these projects by copying the note.json link into Zeppelin Hub Viewer.

When working with text / unstructured data, there are a few things to keep in mind:

Cleaning the text is important (remove stopwords, remove punctuation, typically you will want to lowcase/upcase all words, account for stemming, tag the part-of-speech, etc.). Tagging part-of-speech is an advanced option, but can enhance the accuracy if use on the right use case.
Most text analytics projects involve creating a term-document matrix (TFIDF, which is a term frequency, inverse document frequency matrix). This is typically done within spark using the HashingTF function.
From here, you can use the TFIDF vectors and feed them into a clustering algorithm, such as kmeans, LDA, or a really good option would be to use SVD (singular value decomposition).
You could also use the TFIDF matrix paired with structured data and use it within a classification (or regression) algorithm such as Naive Bayes, a Decision Tree model, Random Forest, etc.

This process will help you understand your text by (1) finding data-driven topics using the matrix reduction / clustering techniques or by (2) using the term-document matrix to predict an outcome (probability failure, likelihood to churn, etc.)

You may also want to check out Word2Vec (I have an example in my github).

Hope this helps!

View solution in original post

dzaratsian · ‎07-29-2016

I've placed a few pyspark scripts on my github: https://github.com/zaratsian/pyspark. You can demo/show these projects by copying the note.json link into Zeppelin Hub Viewer.

When working with text / unstructured data, there are a few things to keep in mind:

Cleaning the text is important (remove stopwords, remove punctuation, typically you will want to lowcase/upcase all words, account for stemming, tag the part-of-speech, etc.). Tagging part-of-speech is an advanced option, but can enhance the accuracy if use on the right use case.
Most text analytics projects involve creating a term-document matrix (TFIDF, which is a term frequency, inverse document frequency matrix). This is typically done within spark using the HashingTF function.
From here, you can use the TFIDF vectors and feed them into a clustering algorithm, such as kmeans, LDA, or a really good option would be to use SVD (singular value decomposition).
You could also use the TFIDF matrix paired with structured data and use it within a classification (or regression) algorithm such as Naive Bayes, a Decision Tree model, Random Forest, etc.

This process will help you understand your text by (1) finding data-driven topics using the matrix reduction / clustering techniques or by (2) using the term-document matrix to predict an outcome (probability failure, likelihood to churn, etc.)

You may also want to check out Word2Vec (I have an example in my github).

Hope this helps!

Cloudera Community

Support Questions

Text and Data Mining

Spark Text Analytics - Uncovering Data-Driven Topi...

Extracting data from unstructured logs text from m...

Methodology to apply Data Mining in Big Data

Comparing sql data with text file using Spark.

Spark Mllib - Frequent Pattern Mining - strange r...

Using OpenNLP for Identifying Names From Text

Cannot Insert Data from Text File Format Table to ...

PyCharm and Spark Connect Quickstart in Cloudera D...

How to save dataframe as text file

Turn Text into Attributes