<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: How to find the best StepSize in a Spark ML LinearRegression Model in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-find-the-best-StepSize-in-a-Spark-ML-LinearRegression/m-p/118987#M26385</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/2056/ahadjidj.html" nodeid="2056"&gt;@Abdelkrim Hadjidj&lt;/A&gt;
&lt;/P&gt;&lt;P&gt; That code snippet is broken and does not work.  LinearRegressionWithSGD is unable to set those parameters in ParamGridBuilder.&lt;/P&gt;&lt;P&gt;Do you have any suggestions for cross validating when working with Linear Regression SGD models? The way you showed in the snippet and the spark documentation online does not work&lt;/P&gt;&lt;PRE&gt;AttributeError: 'LinearRegressionWithSGD' object has no attribute 'stepSize'
&lt;/PRE&gt;&lt;P&gt;&amp;lt;console&amp;gt;:21: error: not found: type ParamGridBuilder
       val paramGrid = new ParamGridBuilder() .addGrid(lr.stepSize, Array(0.1, 0.01)) .build()&lt;/P&gt;</description>
    <pubDate>Mon, 30 Oct 2017 09:15:27 GMT</pubDate>
    <dc:creator>robjarvis92</dc:creator>
    <dc:date>2017-10-30T09:15:27Z</dc:date>
    <item>
      <title>How to find the best StepSize in a Spark ML LinearRegression Model</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-find-the-best-StepSize-in-a-Spark-ML-LinearRegression/m-p/118984#M26382</link>
      <description>&lt;P&gt;
	After playing with the Spark 1.6 LinearRegression model I found it is very sensitive to the StepSize.  What is the best practice around tuning this parameter?  The mean squared error of the model I build varies greatly depending on this input. &lt;/P&gt;&lt;PRE&gt;// Building the model
val numIterations = 30
val stepSize = 0.0001
val linearModel = LinearRegressionWithSGD.train(trainingDataRDD, numIterations, stepSize)


// Evaluate model on training examples and compute training error
val valuesAndPreds = trainingDataRDD .map { point =&amp;gt;
  val prediction = linearModel.predict(point.features)
  (point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) =&amp;gt; math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " +  NumberFormat.getInstance().format(MSE) )
&lt;/PRE&gt;</description>
      <pubDate>Wed, 27 Apr 2016 22:07:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-find-the-best-StepSize-in-a-Spark-ML-LinearRegression/m-p/118984#M26382</guid>
      <dc:creator>khaslbeck</dc:creator>
      <dc:date>2016-04-27T22:07:09Z</dc:date>
    </item>
    <item>
      <title>Re: How to find the best StepSize in a Spark ML LinearRegression Model</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-find-the-best-StepSize-in-a-Spark-ML-LinearRegression/m-p/118985#M26383</link>
      <description>&lt;TABLE&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD&gt;First, I'm assuming you are essentially following the MLlib examples here:
&lt;A href="https://spark.apache.org/docs/latest/mllib-linear-methods.html"&gt;https://spark.apache.org/docs/latest/mllib-linear-methods.html&lt;/A&gt; 

StepSize is one of the hyper-parameters, or inputs that are arbitrarily selected by the data scientist; i.e. they are not an aspect of the dataset. In order to select the best hyper-parameter values, you can perform a Grid-Search to select the best parameter values. For instance, you can use a list of stepSize values: stepSizeList = {0.1, 0.001, 0.000001} and cycle through each one to see which yields the best model. Here is an article describing hyper-parameter tuning and grid search:


&lt;P&gt;&lt;A href="http://blog.dato.com/how-to-evaluate-machine-learning-models-part-4-hyperparameter-tuning"&gt;http://blog.dato.com/how-to-evaluate-machine-learning-models-part-4-hyperparameter-tuning&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Quote: &lt;EM&gt;"For regularization parameters, it’s common to use exponential scale: 1e-5, 1e-4, 1e-3, … 1. Some guess work is necessary to specify the minimum and maximum values."&lt;/EM&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;&lt;/TABLE&gt;</description>
      <pubDate>Wed, 27 Apr 2016 22:47:20 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-find-the-best-StepSize-in-a-Spark-ML-LinearRegression/m-p/118985#M26383</guid>
      <dc:creator>phargis</dc:creator>
      <dc:date>2016-04-27T22:47:20Z</dc:date>
    </item>
    <item>
      <title>Re: How to find the best StepSize in a Spark ML LinearRegression Model</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-find-the-best-StepSize-in-a-Spark-ML-LinearRegression/m-p/118986#M26384</link>
      <description>&lt;P&gt;
	Hi &lt;A rel="user" href="https://community.cloudera.com/users/2977/khaslbeck.html" nodeid="2977"&gt;@Kirk Haslbeck&lt;/A&gt;,&lt;/P&gt;&lt;P&gt;
	I want to add some information to the excellent Paul's answer. &lt;/P&gt;&lt;P&gt;
	First, tuning an ML parameters is one of hardest tasks of a data scientist and it's an active research area. In your special case (LinearRegressionWithSGD), the stepSize is one of a hardest parameter to tune as stated in MLlib optimisation page &lt;A href="https://spark.apache.org/docs/latest/mllib-optimization.html"&gt;here&lt;/A&gt;:&lt;/P&gt;&lt;BLOCKQUOTE&gt;Step-size. The parameter γγ is the step-size, which in the default implementation is chosen decreasing with the square root of the iteration counter, i.e. γ:=st√γ:=st in the tt-th iteration, with the input parameter s=s= stepSize. Note that selecting the best step-size for SGD methods can often be delicate in practice and is a topic of active research.&lt;/BLOCKQUOTE&gt;&lt;P&gt;In a general ML problem, you want to build a data pipeline where you combine several data transformations to clean data and build features as well as several algorithms to achieve the best performance. This is a repetitive task where you try several options for each step. Also, you would like to test several parameters and choose the best one. For each of your pipeline, you need to evaluate the combination of algorithms/parameters that you have chosen. For the evaluation you can use things like cross-validation. &lt;/P&gt;&lt;P&gt;Testing the combination of these variables manually can be hard and time consuming. Spark.ml is a package that can help make this process fluent. Spark.ml uses concepts such as transformers, estimators and params. The "params" helps you automatically test several values for a parameter and choose the value that gives you the best model. This works by providing a ParamGridBuilder with the different values that you want to consider for each param in your pipeline. An example of this is in your case can be :&lt;/P&gt;&lt;PRE&gt;val lr = new LinearRegressionWithSGD()
  .setNumIterations(30)

val paramGrid = new ParamGridBuilder()
  .addGrid(lr.setpSize, Array(0.1, 0.01))
  .build()&lt;/PRE&gt;&lt;P&gt;Even if your ML problem is simple, I highly recommend looking to the &lt;A target="_blank" href="http://spark.apache.org/docs/latest/ml-guide.html"&gt;Spark.ml library&lt;/A&gt;. This can reduce your dev time considerably.&lt;/P&gt;&lt;P&gt;I hope this helps.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Apr 2016 03:16:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-find-the-best-StepSize-in-a-Spark-ML-LinearRegression/m-p/118986#M26384</guid>
      <dc:creator>ahadjidj</dc:creator>
      <dc:date>2016-04-28T03:16:13Z</dc:date>
    </item>
    <item>
      <title>Re: How to find the best StepSize in a Spark ML LinearRegression Model</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-find-the-best-StepSize-in-a-Spark-ML-LinearRegression/m-p/118987#M26385</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/2056/ahadjidj.html" nodeid="2056"&gt;@Abdelkrim Hadjidj&lt;/A&gt;
&lt;/P&gt;&lt;P&gt; That code snippet is broken and does not work.  LinearRegressionWithSGD is unable to set those parameters in ParamGridBuilder.&lt;/P&gt;&lt;P&gt;Do you have any suggestions for cross validating when working with Linear Regression SGD models? The way you showed in the snippet and the spark documentation online does not work&lt;/P&gt;&lt;PRE&gt;AttributeError: 'LinearRegressionWithSGD' object has no attribute 'stepSize'
&lt;/PRE&gt;&lt;P&gt;&amp;lt;console&amp;gt;:21: error: not found: type ParamGridBuilder
       val paramGrid = new ParamGridBuilder() .addGrid(lr.stepSize, Array(0.1, 0.01)) .build()&lt;/P&gt;</description>
      <pubDate>Mon, 30 Oct 2017 09:15:27 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-find-the-best-StepSize-in-a-Spark-ML-LinearRegression/m-p/118987#M26385</guid>
      <dc:creator>robjarvis92</dc:creator>
      <dc:date>2017-10-30T09:15:27Z</dc:date>
    </item>
  </channel>
</rss>

