Created 06-22-2017 09:00 AM
Hi,
I am using spark ml classifier to create a model to predict whether a user will buy a product or not and I am using features both continuous and categorical. The model is predicting the outcome with some accuracy. Is there a way to see how much each feature is contributing to the prediction (i.e) the weight of each feature in the model created from training data? I am using spark 2.0.0 and all the classifiers available in it
Thanks in Advance
Created 06-27-2017 09:39 PM
@Rajasekaran Dhanasekaran How to do this is going to vary a lot of from model to model. So, e.g., RandomForestRegressionModel has .featureImportances whereas LogisticRegressionModel only has the feature weight matrix .weights.
To make things worse, there are very specific conditions under which it's possible to interpret model coefficients; if any features are correlated, "multicollinearities" arise and a parameter might be bigger or smaller, significant or non-significant depending on the correlated one. So it's not necessarily meaningful if you don't ensure there's no multicollinearity. Overall a model can still perform well even if this is the case though so you have to be careful. If you're certain that you've eliminated such correlations, then in regression regimes the model coefficients can be usually be interpreted somewhat directly. Since it sounds like you're doing binary classification (buy or not buy) you'll want to look into "how to interpret logistic regression coefficients".
If you're throwing hundreds or thousands of features at your problem, interpreting the coefficients isn't going to work well though. So, if you'd like to better understand how your features explain your data, I'd recommend looking at things like the ChiSqSelector feature selector or dimensionality reduction techniques such as PCA. You can use these at development time to more perspicuously answer questions like "what features best explain my data?".
Created 06-27-2017 09:39 PM
@Rajasekaran Dhanasekaran How to do this is going to vary a lot of from model to model. So, e.g., RandomForestRegressionModel has .featureImportances whereas LogisticRegressionModel only has the feature weight matrix .weights.
To make things worse, there are very specific conditions under which it's possible to interpret model coefficients; if any features are correlated, "multicollinearities" arise and a parameter might be bigger or smaller, significant or non-significant depending on the correlated one. So it's not necessarily meaningful if you don't ensure there's no multicollinearity. Overall a model can still perform well even if this is the case though so you have to be careful. If you're certain that you've eliminated such correlations, then in regression regimes the model coefficients can be usually be interpreted somewhat directly. Since it sounds like you're doing binary classification (buy or not buy) you'll want to look into "how to interpret logistic regression coefficients".
If you're throwing hundreds or thousands of features at your problem, interpreting the coefficients isn't going to work well though. So, if you'd like to better understand how your features explain your data, I'd recommend looking at things like the ChiSqSelector feature selector or dimensionality reduction techniques such as PCA. You can use these at development time to more perspicuously answer questions like "what features best explain my data?".