07-25-2017 12:17 PM
Our CDSW is installed on an edge node which has a minimal configuration as compared to the main Hadoop cluster where the data resides. When I'm running R jobs, its running on the docker/edge node instead of running on the cluster.
Is there's any way I can make it run on the cluster (which has more muscle power than the edge node)?
Also note that SparkR is not available for R 3.3.0 (the version of R on CDSW), hence using Spark might not be an option that I can leverage unless there's a workaround I am not aware of.
07-25-2017 12:30 PM
Thanks for your question.
Standalone R and Python jobs run only on the CDSW edge nodes where we have more control over dependency management using Docker. However these jobs can push workloads into the cluster using tools like PySpark, Sparklyr, Impala, and Hive. This allows you to get full dependency management for R and Python in the edge environment while still scaling specific workloads into the cluster. There is not currently a way to run the R and Python jobs themselves under YARN.
In terms of SparkR, we recommend, but do not directly support, Sparklyr instead of SparkR.
I hope that is helpful.