Support Questions

Find answers, ask questions, and share your expertise

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Cloudera Community
- :
- Support
- :
- Support Questions
- :
- Is there a way to find the weights of every featur...

Announcements

Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

Highlighted

Labels:

New Contributor

Created 06-22-2017 09:00 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi,

I am using spark ml classifier to create a model to predict whether a user will buy a product or not and I am using features both continuous and categorical. The model is predicting the outcome with some accuracy. Is there a way to see how much each feature is contributing to the prediction (i.e) the weight of each feature in the model created from training data? I am using spark 2.0.0 and all the classifiers available in it

Thanks in Advance

1 ACCEPTED SOLUTION

Accepted Solutions

Guru

Created 06-27-2017 09:39 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

@Rajasekaran Dhanasekaran How to do this is going to vary a lot of from model to model. So, e.g., RandomForestRegressionModel has .featureImportances whereas LogisticRegressionModel only has the feature weight matrix .weights.

To make things worse, there are very specific conditions under which it's possible to interpret model coefficients; if any features are correlated, "multicollinearities" arise and a parameter might be bigger or smaller, significant or non-significant depending on the correlated one. So it's not necessarily meaningful if you don't ensure there's no multicollinearity. Overall a model can still perform well even if this is the case though so you have to be careful. If you're certain that you've eliminated such correlations, then in regression regimes the model coefficients can be usually be interpreted somewhat directly. Since it sounds like you're doing binary classification (buy or not buy) you'll want to look into "how to interpret logistic regression coefficients".

If you're throwing hundreds or thousands of features at your problem, interpreting the coefficients isn't going to work well though. So, if you'd like to better understand how your features explain your data, I'd recommend looking at things like the ChiSqSelector feature selector or dimensionality reduction techniques such as PCA. You can use these at development time to more perspicuously answer questions like "what features best explain my data?".

1 REPLY 1

Guru

Created 06-27-2017 09:39 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

@Rajasekaran Dhanasekaran How to do this is going to vary a lot of from model to model. So, e.g., RandomForestRegressionModel has .featureImportances whereas LogisticRegressionModel only has the feature weight matrix .weights.

To make things worse, there are very specific conditions under which it's possible to interpret model coefficients; if any features are correlated, "multicollinearities" arise and a parameter might be bigger or smaller, significant or non-significant depending on the correlated one. So it's not necessarily meaningful if you don't ensure there's no multicollinearity. Overall a model can still perform well even if this is the case though so you have to be careful. If you're certain that you've eliminated such correlations, then in regression regimes the model coefficients can be usually be interpreted somewhat directly. Since it sounds like you're doing binary classification (buy or not buy) you'll want to look into "how to interpret logistic regression coefficients".

If you're throwing hundreds or thousands of features at your problem, interpreting the coefficients isn't going to work well though. So, if you'd like to better understand how your features explain your data, I'd recommend looking at things like the ChiSqSelector feature selector or dimensionality reduction techniques such as PCA. You can use these at development time to more perspicuously answer questions like "what features best explain my data?".

Coming from Hortonworks? Activate your account here