I've placed a few pyspark scripts on my github: https://github.com/zaratsian/pyspark. You can demo/show these projects by copying the note.json link into Zeppelin Hub Viewer.
When working with text / unstructured data, there are a few things to keep in mind:
- Cleaning the text is important (remove stopwords, remove punctuation, typically you will want to lowcase/upcase all words, account for stemming, tag the part-of-speech, etc.). Tagging part-of-speech is an advanced option, but can enhance the accuracy if use on the right use case.
- Most text analytics projects involve creating a term-document matrix (TFIDF, which is a term frequency, inverse document frequency matrix). This is typically done within spark using the HashingTF function.
- From here, you can use the TFIDF vectors and feed them into a clustering algorithm, such as kmeans, LDA, or a really good option would be to use SVD (singular value decomposition).
- You could also use the TFIDF matrix paired with structured data and use it within a classification (or regression) algorithm such as Naive Bayes, a Decision Tree model, Random Forest, etc.
This process will help you understand your text by (1) finding data-driven topics using the matrix reduction / clustering techniques or by (2) using the term-document matrix to predict an outcome (probability failure, likelihood to churn, etc.)
You may also want to check out Word2Vec (I have an example in my github).
Hope this helps!