Currently, I can access the HDFS from inside my application, but I'd also like to - instead of running my local spark - to use Cloudera's spark as it is enabled in Cloudera Manager.
Righ now I have the HDFS defined at core-site.xml, and I run my app as (--master) YARN. Thus I don't need to set the machine address to my HDFS files. In this way, my SPARK job runs locally and not in the "cluster." I don't want that for now. When I try to set --master to [namenode]:[port] it does not connect. I wonder if I'm directing to the correct port, or if I have to map this port at docker container. Or if I'm missing something about Yarn setup.
Additionally, I've been testing SnappyData (Inc) solution as a Spark SQL in-memory database. So my goal is to run snappy JVMs locally, but redirecting spark jobs to the VM cluster. The whole idea here is to test some performance against some Hadoop implementation. This solution is not a final product (if snappy is local, and spark is "really" remote, I believe it won't be efficient - but in this scenario, I would bring snappy JVMs to the same cluster..)