Support Questions

Find answers, ask questions, and share your expertise

How to find the best StepSize in a Spark ML LinearRegression Model

avatar
Expert Contributor

After playing with the Spark 1.6 LinearRegression model I found it is very sensitive to the StepSize. What is the best practice around tuning this parameter? The mean squared error of the model I build varies greatly depending on this input.

// Building the model
val numIterations = 30
val stepSize = 0.0001
val linearModel = LinearRegressionWithSGD.train(trainingDataRDD, numIterations, stepSize)


// Evaluate model on training examples and compute training error
val valuesAndPreds = trainingDataRDD .map { point =>
  val prediction = linearModel.predict(point.features)
  (point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " +  NumberFormat.getInstance().format(MSE) )
1 ACCEPTED SOLUTION

avatar
First, I'm assuming you are essentially following the MLlib examples here: https://spark.apache.org/docs/latest/mllib-linear-methods.html StepSize is one of the hyper-parameters, or inputs that are arbitrarily selected by the data scientist; i.e. they are not an aspect of the dataset. In order to select the best hyper-parameter values, you can perform a Grid-Search to select the best parameter values. For instance, you can use a list of stepSize values: stepSizeList = {0.1, 0.001, 0.000001} and cycle through each one to see which yields the best model. Here is an article describing hyper-parameter tuning and grid search:

http://blog.dato.com/how-to-evaluate-machine-learning-models-part-4-hyperparameter-tuning

Quote: "For regularization parameters, it’s common to use exponential scale: 1e-5, 1e-4, 1e-3, … 1. Some guess work is necessary to specify the minimum and maximum values."

View solution in original post

3 REPLIES 3

avatar
First, I'm assuming you are essentially following the MLlib examples here: https://spark.apache.org/docs/latest/mllib-linear-methods.html StepSize is one of the hyper-parameters, or inputs that are arbitrarily selected by the data scientist; i.e. they are not an aspect of the dataset. In order to select the best hyper-parameter values, you can perform a Grid-Search to select the best parameter values. For instance, you can use a list of stepSize values: stepSizeList = {0.1, 0.001, 0.000001} and cycle through each one to see which yields the best model. Here is an article describing hyper-parameter tuning and grid search:

http://blog.dato.com/how-to-evaluate-machine-learning-models-part-4-hyperparameter-tuning

Quote: "For regularization parameters, it’s common to use exponential scale: 1e-5, 1e-4, 1e-3, … 1. Some guess work is necessary to specify the minimum and maximum values."

avatar

Hi @Kirk Haslbeck,

I want to add some information to the excellent Paul's answer.

First, tuning an ML parameters is one of hardest tasks of a data scientist and it's an active research area. In your special case (LinearRegressionWithSGD), the stepSize is one of a hardest parameter to tune as stated in MLlib optimisation page here:

Step-size. The parameter γγ is the step-size, which in the default implementation is chosen decreasing with the square root of the iteration counter, i.e. γ:=st√γ:=st in the tt-th iteration, with the input parameter s=s= stepSize. Note that selecting the best step-size for SGD methods can often be delicate in practice and is a topic of active research.

In a general ML problem, you want to build a data pipeline where you combine several data transformations to clean data and build features as well as several algorithms to achieve the best performance. This is a repetitive task where you try several options for each step. Also, you would like to test several parameters and choose the best one. For each of your pipeline, you need to evaluate the combination of algorithms/parameters that you have chosen. For the evaluation you can use things like cross-validation.

Testing the combination of these variables manually can be hard and time consuming. Spark.ml is a package that can help make this process fluent. Spark.ml uses concepts such as transformers, estimators and params. The "params" helps you automatically test several values for a parameter and choose the value that gives you the best model. This works by providing a ParamGridBuilder with the different values that you want to consider for each param in your pipeline. An example of this is in your case can be :

val lr = new LinearRegressionWithSGD()
  .setNumIterations(30)

val paramGrid = new ParamGridBuilder()
  .addGrid(lr.setpSize, Array(0.1, 0.01))
  .build()

Even if your ML problem is simple, I highly recommend looking to the Spark.ml library. This can reduce your dev time considerably.

I hope this helps.

avatar
New Contributor

@Abdelkrim Hadjidj

That code snippet is broken and does not work. LinearRegressionWithSGD is unable to set those parameters in ParamGridBuilder.

Do you have any suggestions for cross validating when working with Linear Regression SGD models? The way you showed in the snippet and the spark documentation online does not work

AttributeError: 'LinearRegressionWithSGD' object has no attribute 'stepSize'

<console>:21: error: not found: type ParamGridBuilder val paramGrid = new ParamGridBuilder() .addGrid(lr.stepSize, Array(0.1, 0.01)) .build()