- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Text and Data Mining
Created ‎07-28-2016 10:54 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am looking for a reference and demos to show Text & Data mining capabilities on our platform.
I am trying to answer one of the RFP questions.
Any help is highly appreciated.
Thanks,
Sujitha
Created ‎07-29-2016 03:10 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've placed a few pyspark scripts on my github: https://github.com/zaratsian/pyspark. You can demo/show these projects by copying the note.json link into Zeppelin Hub Viewer.
When working with text / unstructured data, there are a few things to keep in mind:
- Cleaning the text is important (remove stopwords, remove punctuation, typically you will want to lowcase/upcase all words, account for stemming, tag the part-of-speech, etc.). Tagging part-of-speech is an advanced option, but can enhance the accuracy if use on the right use case.
- Most text analytics projects involve creating a term-document matrix (TFIDF, which is a term frequency, inverse document frequency matrix). This is typically done within spark using the HashingTF function.
- From here, you can use the TFIDF vectors and feed them into a clustering algorithm, such as kmeans, LDA, or a really good option would be to use SVD (singular value decomposition).
- You could also use the TFIDF matrix paired with structured data and use it within a classification (or regression) algorithm such as Naive Bayes, a Decision Tree model, Random Forest, etc.
This process will help you understand your text by (1) finding data-driven topics using the matrix reduction / clustering techniques or by (2) using the term-document matrix to predict an outcome (probability failure, likelihood to churn, etc.)
You may also want to check out Word2Vec (I have an example in my github).
Hope this helps!
Created ‎07-29-2016 03:10 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've placed a few pyspark scripts on my github: https://github.com/zaratsian/pyspark. You can demo/show these projects by copying the note.json link into Zeppelin Hub Viewer.
When working with text / unstructured data, there are a few things to keep in mind:
- Cleaning the text is important (remove stopwords, remove punctuation, typically you will want to lowcase/upcase all words, account for stemming, tag the part-of-speech, etc.). Tagging part-of-speech is an advanced option, but can enhance the accuracy if use on the right use case.
- Most text analytics projects involve creating a term-document matrix (TFIDF, which is a term frequency, inverse document frequency matrix). This is typically done within spark using the HashingTF function.
- From here, you can use the TFIDF vectors and feed them into a clustering algorithm, such as kmeans, LDA, or a really good option would be to use SVD (singular value decomposition).
- You could also use the TFIDF matrix paired with structured data and use it within a classification (or regression) algorithm such as Naive Bayes, a Decision Tree model, Random Forest, etc.
This process will help you understand your text by (1) finding data-driven topics using the matrix reduction / clustering techniques or by (2) using the term-document matrix to predict an outcome (probability failure, likelihood to churn, etc.)
You may also want to check out Word2Vec (I have an example in my github).
Hope this helps!
