Hi, let me briefly describe my problem: Initial task was to try some NLP practices in close real-world problems. We decided to start with simple classification problem - predict a genre for a music lyric.
We had a strong requirement of using Java, Spark 2.0.0 with ML library (not MLLib). ML library has a limited number of algorithms, so we started with binary classification for 2 genres and simple pipeline with Word2Vec and Logistic Regression. It showed acceptable results. Than we decided to add one genre. So we had to some other algorithm, because Logistic Regression works only for binary problems. So we've tried 3 approaches: 1. Bag of words + Naive Bayes - around 82% precision
2. Word2Vec + Logistic Regression + One vs Rest - around 65% precision
3. Word2Vec + MixMaxScaler + Naive Bayes - aroung 58% precision
First approach showed a good results, but the concern is that Bag of Words is a bit old and it doesn't handle well similar words. So we're still in search with better solution with Word2Vec.
We had a thought to try Desicion Tree or Random Forecast, but we're not sure how performant it will be with large vectors (100, 200, 300) and large datasets. I read that it's good to use *Tree approaches when you have small number of feature.
Maybe you could recommend some other approach based on your experience? Any help is very appreciated. I have to reming that we're strongly tied to Spark 2.0.0 + ML library due to DevOps infrastructure.
... View more