Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to cast variable of type ML sparse vector type to MLlib sparse vector type ?

How to cast variable of type ML sparse vector type to MLlib sparse vector type ?

New Contributor

when i execute the below code,

 val realout = output.select("label","features").rdd.map(row => LabeledPoint(
   row.getAs[Double]("label"),   
   row.getAs[org.apache.spark.mllib.linalg.SparseVector]("features")
))

i am getting an error as stated below,

[error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 13.0 failed 1 times, most recent failure: Lost task 0.0 in stage 13.0 (TID 13, localhost): java.lang.ClassCastException: org.apache.spark.ml.linalg.SparseVector cannot be cast to org.apache.spark.mllib.linalg.Vector
[error] 	at DataCleaning$$anonfun$1.apply(DataCleaning.scala:107)
[error] 	at DataCleaning$$anonfun$1.apply(DataCleaning.scala:105)
[error] 	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
[error] 	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
[error] 	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:213)
[error] 	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
[error] 	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
[error] 	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
[error] 	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
[error] 	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
[error] 	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
[error] 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
[error] 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
[error] 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
[error] 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
[error] 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
[error] 	at org.apache.spark.scheduler.Task.run(Task.scala:86)
[error] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
[error] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[error] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[error] 	at java.lang.Thread.run(Thread.java:745)
[error] 
[error] Driver stacktrace

Kindly help me out.

1 REPLY 1
Highlighted

Re: How to cast variable of type ML sparse vector type to MLlib sparse vector type ?

New Contributor

I had the same problem. Turns out MLLib Vectors has a helper function to convert a ML Vector to a MLLib Vector. Use like this:

 val realout = output.select("label","features").rdd.map(row => LabeledPoint(
   row.getAs[Double]("label"),
   org.apache.spark.mllib.linalg.Vectors.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))
))

Note that this function is available versions 2.0.0 and up. You can also use

MLUtils.convertVectorColumnsFromML(df)

Or the other way around:

MLUtils.convertMatrixColumnsToML(df)
Don't have an account?
Coming from Hortonworks? Activate your account here