When I execute SparkR:::toRDD, in order to convert a dataframe to rdd, it looks like one R process is executed on one datanode and all data is passed through the R process. Takes a long time. Is there a way to parallelize this operation?
I think this transformation happens at driver and hence it is talking time.
Found : https://issues.apache.org/jira/browse/SPARK-8277