Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark ML confidence interval

Spark ML confidence interval

When running a Spark ML model, the precision of the model was low, around 65%. Could this be due to the small set of training data we are working with. Does the size of the training data influence the confidence interval?

2 REPLIES 2
Highlighted

Re: Spark ML confidence interval

Rising Star

Yes, of course more data you have more precision you gain, because the algorithm has more data to learn from. Anyway, this cannot be the only issue, but with few data you can't think about having a good model.

Re: Spark ML confidence interval

:-) It is complicated. Confidence has per se nothing to do with the amount of training data although that is part of it. A small set of high quality sampled data is often better than billions of rows.

It essentially depends on the complexity of the function you want to predict. If the underlying function is essentially linear ( crime decreasing with age or whatever ) you only need a couple data points to predict that. More may result in something called overtraining because the algorithm can try to fit it with a much more complex ( incorrect ) function.

The main reason for low confidence is noise in your data or bad attributes. If all data points closely follow the underlying function confidence will be high, if they are only losely associated with the attributes you provide the model will not fit the data points well and confidence will be low.

Its a really complex topic and the following class was invaluable in understanding it. Its free as well.

http://online.stanford.edu/course/statistical-learning-Winter-16

To answer your question: Increasing the amounts of data CAN help but it doesn't have to. Other reasons for more confidence can include

- too much noise

- bad attributes that doesn't really predict the target

- too much data for simple underlying models ( as counter productive as it sounds )

There are also other problems like overtraining which will result in very high confidence but wrong results. Now there are some algorithms like Associations, and clustering ( for outlier detection ) which benefit hugely from large data sets. But for classic classification/regression models you ideally have a good amount of random data in correlation with the complexity of the function you want to predict. More doesn't help much.

Don't have an account?
Coming from Hortonworks? Activate your account here