Support Questions

ssanku · ‎07-28-2016

Hi,

I am looking for a reference and demos to show Text & Data mining capabilities on our platform.

I am trying to answer one of the RFP questions.

Any help is highly appreciated.

Thanks,

Sujitha

dzaratsian · ‎07-29-2016

I've placed a few pyspark scripts on my github: https://github.com/zaratsian/pyspark. You can demo/show these projects by copying the note.json link into Zeppelin Hub Viewer.

When working with text / unstructured data, there are a few things to keep in mind:

Cleaning the text is important (remove stopwords, remove punctuation, typically you will want to lowcase/upcase all words, account for stemming, tag the part-of-speech, etc.). Tagging part-of-speech is an advanced option, but can enhance the accuracy if use on the right use case.
Most text analytics projects involve creating a term-document matrix (TFIDF, which is a term frequency, inverse document frequency matrix). This is typically done within spark using the HashingTF function.
From here, you can use the TFIDF vectors and feed them into a clustering algorithm, such as kmeans, LDA, or a really good option would be to use SVD (singular value decomposition).
You could also use the TFIDF matrix paired with structured data and use it within a classification (or regression) algorithm such as Naive Bayes, a Decision Tree model, Random Forest, etc.

This process will help you understand your text by (1) finding data-driven topics using the matrix reduction / clustering techniques or by (2) using the term-document matrix to predict an outcome (probability failure, likelihood to churn, etc.)

You may also want to check out Word2Vec (I have an example in my github).

Hope this helps!

View solution in original post

dzaratsian · ‎07-29-2016

I've placed a few pyspark scripts on my github: https://github.com/zaratsian/pyspark. You can demo/show these projects by copying the note.json link into Zeppelin Hub Viewer.

When working with text / unstructured data, there are a few things to keep in mind:

Cleaning the text is important (remove stopwords, remove punctuation, typically you will want to lowcase/upcase all words, account for stemming, tag the part-of-speech, etc.). Tagging part-of-speech is an advanced option, but can enhance the accuracy if use on the right use case.
Most text analytics projects involve creating a term-document matrix (TFIDF, which is a term frequency, inverse document frequency matrix). This is typically done within spark using the HashingTF function.
From here, you can use the TFIDF vectors and feed them into a clustering algorithm, such as kmeans, LDA, or a really good option would be to use SVD (singular value decomposition).
You could also use the TFIDF matrix paired with structured data and use it within a classification (or regression) algorithm such as Naive Bayes, a Decision Tree model, Random Forest, etc.

This process will help you understand your text by (1) finding data-driven topics using the matrix reduction / clustering techniques or by (2) using the term-document matrix to predict an outcome (probability failure, likelihood to churn, etc.)

You may also want to check out Word2Vec (I have an example in my github).

Hope this helps!

Cloudera Community

Support Questions

Text and Data Mining

Spark Text Analytics - Uncovering Data-Driven Topi...

Dataviz text search filter behaves strangely

Methodology to apply Data Mining in Big Data

Hive table compression: bz2 vs Text vs Orc vs Parq...

Using OpenNLP for Identifying Names From Text

Spark Mllib - Frequent Pattern Mining - strange r...

Counting lines in text files with NiFi - part 1

Spark (PySpark) for ETL to join text files with My...

Enterprise Data Quality at Scale with Spark and Gr...

How to save dataframe as text file