Support Questions

Find answers, ask questions, and share your expertise

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Cloudera Community
- :
- Support
- :
- Support Questions
- :
- Re: Spark ML confidence interval

Announcements

Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

When running a Spark ML model, the precision of the model was low, around 65%. Could this be due to the small set of training data we are working with. Does the size of the training data influence the confidence interval?

Spark ML confidence interval

Labels:

Guru

Created 06-09-2016 10:49 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

2 REPLIES 2

Highlighted
##

Yes, of course more data you have more precision you gain, because the algorithm has more data to learn from. Anyway, this cannot be the only issue, but with few data you can't think about having a good model.

Re: Spark ML confidence interval

Rising Star

Created 06-10-2016 07:12 AM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Re: Spark ML confidence interval

Guru

Created 06-10-2016 08:22 AM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

:-) It is complicated. Confidence has per se nothing to do with the amount of training data although that is part of it. A small set of high quality sampled data is often better than billions of rows.

It essentially depends on the complexity of the function you want to predict. If the underlying function is essentially linear ( crime decreasing with age or whatever ) you only need a couple data points to predict that. More may result in something called overtraining because the algorithm can try to fit it with a much more complex ( incorrect ) function.

The main reason for low confidence is noise in your data or bad attributes. If all data points closely follow the underlying function confidence will be high, if they are only losely associated with the attributes you provide the model will not fit the data points well and confidence will be low.

Its a really complex topic and the following class was invaluable in understanding it. Its free as well.

http://online.stanford.edu/course/statistical-learning-Winter-16

To answer your question: Increasing the amounts of data CAN help but it doesn't have to. Other reasons for more confidence can include

- too much noise

- bad attributes that doesn't really predict the target

- too much data for simple underlying models ( as counter productive as it sounds )

There are also other problems like overtraining which will result in very high confidence but wrong results. Now there are some algorithms like Associations, and clustering ( for outlier detection ) which benefit hugely from large data sets. But for classic classification/regression models you ideally have a good amount of random data in correlation with the complexity of the function you want to predict. More doesn't help much.

Coming from Hortonworks? Activate your account here