<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Run RDD operations on SQL Dataframe in 1.3.1 in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Run-RDD-operations-on-SQL-Dataframe-in-1-3-1/m-p/98180#M11657</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/233/omendelevitch.html" nodeid="233"&gt;@Ofer Mendelevith&lt;/A&gt; &lt;/P&gt;&lt;P&gt;I think its an issue with LabeledPoint. It's expecting Labeled but not getting it. &lt;/P&gt;&lt;P&gt;val examples = MLUtils.loadLabeledData(sc,"hdfs:///user/zeppelin/las_demo/part-00000").cache() &lt;/P&gt;&lt;P&gt;val splits = examples.randomSplit(Array(0.8, 0.2)) &lt;/P&gt;&lt;P&gt;val training = splits(0).cache()
val test = splits(1).cache() &lt;/P&gt;&lt;P&gt;val numTraining = training.count() &lt;/P&gt;&lt;P&gt;val numTest = test.count() &lt;/P&gt;&lt;P&gt;println(s"Training: $numTraining, test: $numTest.") &lt;/P&gt;&lt;P&gt; val updater = new SquaredL2Updater()
     val model = {
        val algorithm = new LogisticRegressionWithSGD()
        algorithm.optimizer.setNumIterations(200).setStepSize(1.0).setUpdater(updater).setRegParam(0.1)
        algorithm.run(training).clearThreshold()
     } &lt;/P&gt;&lt;P&gt; 
    val rprediction = model.predict(test.map(_.features))
    val rpredictionAndLabel = rprediction.zip(testRDD.map(_.label)) &lt;/P&gt;&lt;P&gt;    val rmetrics = new BinaryClassificationMetrics(rpredictionAndLabel)&lt;/P&gt;&lt;P&gt;ERROR is as follows:&lt;/P&gt;&lt;P&gt;warning: there were 1 deprecation warning(s); re-run with -deprecation for details
examples: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[52] at map at MLUtils.scala:214
splits: Array[org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]] = Array(PartitionwiseSampledRDD[53] at randomSplit at &amp;lt;console&amp;gt;:72, PartitionwiseSampledRDD[54] at randomSplit at &amp;lt;console&amp;gt;:72)
training: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = PartitionwiseSampledRDD[53] at randomSplit at &amp;lt;console&amp;gt;:72
test: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = PartitionwiseSampledRDD[54] at randomSplit at &amp;lt;console&amp;gt;:72
numTraining: Long = 19589
numTest: Long = 4889
Training: 19589, test: 4889.
updater: org.apache.spark.mllib.optimization.SquaredL2Updater = org.apache.spark.mllib.optimization.SquaredL2Updater@3b9284cd
org.apache.spark.SparkException: Input validation failed.
	at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:210)
	at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:190)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:81)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:87)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:89)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:91)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:93)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:95)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:97)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:99)
	at $iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:101)
	at $iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:103)
	at $iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:105)
	at $iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:107)
	at $iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:109)
	at &amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:111)
	at .&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:115)
	at .&amp;lt;clinit&amp;gt;(&amp;lt;console&amp;gt;)
	at .&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:7)
	at .&amp;lt;clinit&amp;gt;(&amp;lt;console&amp;gt;)
	at $print(&amp;lt;console&amp;gt;)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
	at org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:655)
	at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:620)
	at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:613)
	at org.apache.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:57)
	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:276)
	at org.apache.zeppelin.scheduler.Job.run(Job.java:170)
	at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:118)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)&lt;/P&gt;</description>
    <pubDate>Thu, 10 Dec 2015 03:39:19 GMT</pubDate>
    <dc:creator>vjain</dc:creator>
    <dc:date>2015-12-10T03:39:19Z</dc:date>
    <item>
      <title>Run RDD operations on SQL Dataframe in 1.3.1</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Run-RDD-operations-on-SQL-Dataframe-in-1-3-1/m-p/98178#M11655</link>
      <description>&lt;P&gt;I am trying to run regression on a dataset, but I ran into 2 issues:&lt;/P&gt;&lt;P&gt;1. When I try to Split the dataset, that I imported from a textfile, I get the following error:&lt;/P&gt;&lt;P&gt;java.lang.NumberFormatException: For input string: "[34"&lt;/P&gt;&lt;P&gt;That's because the textfile has the data in the format: [x, y, z ....] [a, b, c ....]&lt;/P&gt;&lt;P&gt;2. So I try to use SparkSQL to create a DF that I can then convert to RDD using xRDD = x.rdd, but I get a type mismatch error. &lt;/P&gt;&lt;P&gt; found   : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
 required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]&lt;/P&gt;&lt;P&gt;How should I resolve this ? &lt;/P&gt;</description>
      <pubDate>Wed, 09 Dec 2015 05:50:30 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Run-RDD-operations-on-SQL-Dataframe-in-1-3-1/m-p/98178#M11655</guid>
      <dc:creator>vjain</dc:creator>
      <dc:date>2015-12-09T05:50:30Z</dc:date>
    </item>
    <item>
      <title>Re: Run RDD operations on SQL Dataframe in 1.3.1</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Run-RDD-operations-on-SQL-Dataframe-in-1-3-1/m-p/98179#M11656</link>
      <description>&lt;P&gt;Can u please post the full code and error log?&lt;/P&gt;</description>
      <pubDate>Wed, 09 Dec 2015 06:23:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Run-RDD-operations-on-SQL-Dataframe-in-1-3-1/m-p/98179#M11656</guid>
      <dc:creator>ofermend</dc:creator>
      <dc:date>2015-12-09T06:23:03Z</dc:date>
    </item>
    <item>
      <title>Re: Run RDD operations on SQL Dataframe in 1.3.1</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Run-RDD-operations-on-SQL-Dataframe-in-1-3-1/m-p/98180#M11657</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/233/omendelevitch.html" nodeid="233"&gt;@Ofer Mendelevith&lt;/A&gt; &lt;/P&gt;&lt;P&gt;I think its an issue with LabeledPoint. It's expecting Labeled but not getting it. &lt;/P&gt;&lt;P&gt;val examples = MLUtils.loadLabeledData(sc,"hdfs:///user/zeppelin/las_demo/part-00000").cache() &lt;/P&gt;&lt;P&gt;val splits = examples.randomSplit(Array(0.8, 0.2)) &lt;/P&gt;&lt;P&gt;val training = splits(0).cache()
val test = splits(1).cache() &lt;/P&gt;&lt;P&gt;val numTraining = training.count() &lt;/P&gt;&lt;P&gt;val numTest = test.count() &lt;/P&gt;&lt;P&gt;println(s"Training: $numTraining, test: $numTest.") &lt;/P&gt;&lt;P&gt; val updater = new SquaredL2Updater()
     val model = {
        val algorithm = new LogisticRegressionWithSGD()
        algorithm.optimizer.setNumIterations(200).setStepSize(1.0).setUpdater(updater).setRegParam(0.1)
        algorithm.run(training).clearThreshold()
     } &lt;/P&gt;&lt;P&gt; 
    val rprediction = model.predict(test.map(_.features))
    val rpredictionAndLabel = rprediction.zip(testRDD.map(_.label)) &lt;/P&gt;&lt;P&gt;    val rmetrics = new BinaryClassificationMetrics(rpredictionAndLabel)&lt;/P&gt;&lt;P&gt;ERROR is as follows:&lt;/P&gt;&lt;P&gt;warning: there were 1 deprecation warning(s); re-run with -deprecation for details
examples: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[52] at map at MLUtils.scala:214
splits: Array[org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]] = Array(PartitionwiseSampledRDD[53] at randomSplit at &amp;lt;console&amp;gt;:72, PartitionwiseSampledRDD[54] at randomSplit at &amp;lt;console&amp;gt;:72)
training: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = PartitionwiseSampledRDD[53] at randomSplit at &amp;lt;console&amp;gt;:72
test: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = PartitionwiseSampledRDD[54] at randomSplit at &amp;lt;console&amp;gt;:72
numTraining: Long = 19589
numTest: Long = 4889
Training: 19589, test: 4889.
updater: org.apache.spark.mllib.optimization.SquaredL2Updater = org.apache.spark.mllib.optimization.SquaredL2Updater@3b9284cd
org.apache.spark.SparkException: Input validation failed.
	at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:210)
	at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:190)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:81)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:87)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:89)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:91)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:93)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:95)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:97)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:99)
	at $iwC$$iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:101)
	at $iwC$$iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:103)
	at $iwC$$iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:105)
	at $iwC$$iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:107)
	at $iwC.&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:109)
	at &amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:111)
	at .&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:115)
	at .&amp;lt;clinit&amp;gt;(&amp;lt;console&amp;gt;)
	at .&amp;lt;init&amp;gt;(&amp;lt;console&amp;gt;:7)
	at .&amp;lt;clinit&amp;gt;(&amp;lt;console&amp;gt;)
	at $print(&amp;lt;console&amp;gt;)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
	at org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:655)
	at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:620)
	at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:613)
	at org.apache.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:57)
	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:276)
	at org.apache.zeppelin.scheduler.Job.run(Job.java:170)
	at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:118)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)&lt;/P&gt;</description>
      <pubDate>Thu, 10 Dec 2015 03:39:19 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Run-RDD-operations-on-SQL-Dataframe-in-1-3-1/m-p/98180#M11657</guid>
      <dc:creator>vjain</dc:creator>
      <dc:date>2015-12-10T03:39:19Z</dc:date>
    </item>
    <item>
      <title>Re: Run RDD operations on SQL Dataframe in 1.3.1</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Run-RDD-operations-on-SQL-Dataframe-in-1-3-1/m-p/98181#M11658</link>
      <description>&lt;P&gt;Looks like the expected format for labeled points is different from what you have. I did run below statement to understand format. I guess the row should be in the format highlighted below. &lt;/P&gt;&lt;P&gt;scala&amp;gt; val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))&lt;/P&gt;&lt;P&gt;
pos: org.apache.spark.mllib.regression.LabeledPoint = &lt;STRONG&gt;(1.0,[1.0,0.0,3.0])&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;
&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2015 05:06:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Run-RDD-operations-on-SQL-Dataframe-in-1-3-1/m-p/98181#M11658</guid>
      <dc:creator>nalini69kasturi</dc:creator>
      <dc:date>2015-12-29T05:06:48Z</dc:date>
    </item>
    <item>
      <title>Re: Run RDD operations on SQL Dataframe in 1.3.1</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Run-RDD-operations-on-SQL-Dataframe-in-1-3-1/m-p/98182#M11659</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/420/vjain.html" nodeid="420"&gt;@Vedant Jain&lt;/A&gt; can you accept the best answer to close this thread or post your solution?&lt;/P&gt;</description>
      <pubDate>Tue, 02 Feb 2016 10:04:31 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Run-RDD-operations-on-SQL-Dataframe-in-1-3-1/m-p/98182#M11659</guid>
      <dc:creator>aervits</dc:creator>
      <dc:date>2016-02-02T10:04:31Z</dc:date>
    </item>
  </channel>
</rss>

