Support Questions

Find answers, ask questions, and share your expertise

Spark Machine Learning for performance prediction

avatar
Expert Contributor

Hi

I am totally new to SparkML. I capture the batch processing information for Spark Streaming and write it to file. I capture the following information per batch

(FYI each batch in spark is a jobset which means it is a set of jobs.)

BatchTime

BatchStarted

FirstJobStartTime

LastJobCompletionTime

FirstJobSchedulingDelay

TotalJobProcessingTime (time to process all jobs in a batch)

NumberOfRecords

SubmissionTime

TotalDelay (Total execution time for a batch from the time it is submitted, scheduled and processed.)

Lets say I want to make a prediction against what will be the total delay when the number of records are X in a batch. Can anyone suggest what machine learning algorithm will be applicable in this scenario (linear regression, classification etc)?

Of course the most important parameters would be scheduling delay, total delay and number of records and Job processing time.

Thanks

1 ACCEPTED SOLUTION

avatar

@Arsalan Siddiqi The standard answer for delay modeling is to model the delay times using an exponential distribution. There's a an analytical Bayesian solution (i.e., no MCMC) to this or you can use GeneralizedLinearRegression from MLlib with the "gamma" family (since the exponential is a special case of the gamma with alpha = 1).

There's probably an alternative way to think about the problem in terms of the number of delayed batches, which could be analyzed using poissons. Without knowing more about the goals it's hard to say what makes more sense.

You can always squint at something and turn it into a classification problem (different classes or buckets for different ranges of data; i.e., quantizing), but there is an order to things like delays and total counts, and your standard classification regimes don't take this into account (there is an ordered logit, but AFAICT MLlib doesn't have this out of the box). That said, such an approach often works (what counts as adequate performance is an empirical matter); so if the aforementioned approach is beyond your current reach, classification could be acceptable since for a business application (vs. scientific inquiry) it's as important to understand what you've done as it is to be correct.

Note: For these regression approaches using non-gaussian error distributions, you will likely need to transform the fitted parameters for them to be interpretable.

View solution in original post

2 REPLIES 2

avatar

@Arsalan Siddiqi The standard answer for delay modeling is to model the delay times using an exponential distribution. There's a an analytical Bayesian solution (i.e., no MCMC) to this or you can use GeneralizedLinearRegression from MLlib with the "gamma" family (since the exponential is a special case of the gamma with alpha = 1).

There's probably an alternative way to think about the problem in terms of the number of delayed batches, which could be analyzed using poissons. Without knowing more about the goals it's hard to say what makes more sense.

You can always squint at something and turn it into a classification problem (different classes or buckets for different ranges of data; i.e., quantizing), but there is an order to things like delays and total counts, and your standard classification regimes don't take this into account (there is an ordered logit, but AFAICT MLlib doesn't have this out of the box). That said, such an approach often works (what counts as adequate performance is an empirical matter); so if the aforementioned approach is beyond your current reach, classification could be acceptable since for a business application (vs. scientific inquiry) it's as important to understand what you've done as it is to be correct.

Note: For these regression approaches using non-gaussian error distributions, you will likely need to transform the fitted parameters for them to be interpretable.

avatar
Expert Contributor

@jfrazee thanks for the reply. I am using spark streaming which processes data in batches. I want to know how long does it take to process a batch for a given application (keeping the factors like number of nodes in the cluster constant) at a given data rate (records/batch). I eventually want to check an SLA to make sure that the end to end delay would still fall within the SLA, therefore I want to gather historic data from the application runs and make predictions for the time to process a batch. before starting a new batch you can already make a prediction whether it would voilate the SLA. I will have a look into your suggestions.

Thanks