Support Questions

Find answers, ask questions, and share your expertise

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Cloudera Community
- :
- Support
- :
- Support Questions
- :
- Spark Machine Learning for performance prediction

- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Solved
Go to solution

Labels:

Expert Contributor

Created 06-26-2017 01:42 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi

I am totally new to SparkML. I capture the batch processing information for Spark Streaming and write it to file. I capture the following information per batch

(FYI each batch in spark is a jobset which means it is a set of jobs.)

BatchTime

BatchStarted

FirstJobStartTime

LastJobCompletionTime

FirstJobSchedulingDelay

TotalJobProcessingTime (time to process all jobs in a batch)

NumberOfRecords

SubmissionTime

TotalDelay (Total execution time for a batch from the time it is submitted, scheduled and processed.)

Lets say I want to make a prediction against what will be the **total delay ***when the number of records are X *in a batch. Can anyone suggest what machine learning algorithm will be applicable in this scenario (linear regression, classification etc)?

Of course the most important parameters would be scheduling delay, total delay and number of records and Job processing time.

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions

Guru

Created 06-26-2017 02:10 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

@Arsalan Siddiqi The standard answer for delay modeling is to model the delay times using an exponential distribution. There's a an analytical Bayesian solution (i.e., no MCMC) to this or you can use GeneralizedLinearRegression from MLlib with the "gamma" family (since the exponential is a special case of the gamma with alpha = 1).

There's probably an alternative way to think about the problem in terms of the number of delayed batches, which could be analyzed using poissons. Without knowing more about the goals it's hard to say what makes more sense.

You can always squint at something and turn it into a classification problem (different classes or buckets for different ranges of data; i.e., quantizing), but there is an order to things like delays and total counts, and your standard classification regimes don't take this into account (there is an ordered logit, but AFAICT MLlib doesn't have this out of the box). That said, such an approach often works (what counts as adequate performance is an empirical matter); so if the aforementioned approach is beyond your current reach, classification could be acceptable since for a business application (vs. scientific inquiry) it's as important to understand what you've done as it is to be correct.

Note: For these regression approaches using non-gaussian error distributions, you will likely need to transform the fitted parameters for them to be interpretable.

2 REPLIES 2

Guru

Created 06-26-2017 02:10 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

@Arsalan Siddiqi The standard answer for delay modeling is to model the delay times using an exponential distribution. There's a an analytical Bayesian solution (i.e., no MCMC) to this or you can use GeneralizedLinearRegression from MLlib with the "gamma" family (since the exponential is a special case of the gamma with alpha = 1).

There's probably an alternative way to think about the problem in terms of the number of delayed batches, which could be analyzed using poissons. Without knowing more about the goals it's hard to say what makes more sense.

You can always squint at something and turn it into a classification problem (different classes or buckets for different ranges of data; i.e., quantizing), but there is an order to things like delays and total counts, and your standard classification regimes don't take this into account (there is an ordered logit, but AFAICT MLlib doesn't have this out of the box). That said, such an approach often works (what counts as adequate performance is an empirical matter); so if the aforementioned approach is beyond your current reach, classification could be acceptable since for a business application (vs. scientific inquiry) it's as important to understand what you've done as it is to be correct.

Note: For these regression approaches using non-gaussian error distributions, you will likely need to transform the fitted parameters for them to be interpretable.

Re: Spark Machine Learning for performance prediction

Expert Contributor

Created 06-26-2017 02:43 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

@jfrazee thanks for the reply. I am using spark streaming which processes data in batches. I want to know how long does it take to process a batch for a given application (keeping the factors like number of nodes in the cluster constant) at a given data rate (records/batch). I eventually want to check an SLA to make sure that the end to end delay would still fall within the SLA, therefore I want to gather historic data from the application runs and make predictions for the time to process a batch. before starting a new batch you can already make a prediction whether it would voilate the SLA. I will have a look into your suggestions.

Thanks

Announcements

What's New @ Cloudera

What's New @ Cloudera

What's New @ Cloudera

What's New @ Cloudera

What's New @ Cloudera