About srowen

srowen · ‎02-16-2017

Yes you can use Spark and all of the services in the VM. None depend on Cloudera Manager at all.

srowen · ‎01-15-2017

No, you definitely do not want to take this dir away from hdfs! in general I'd never mess with the HDFS permissions for key dirs like this. Instead, hdfs needs to make a directory for your user. This kind of stuff happens automatically via Hue.

srowen · ‎01-15-2017

That's the general error you get when you run as user foo, but you haven't set up /user/foo in HDFS, and the usual way that is done is through Hue or syncing with something like Active Directory.

srowen · ‎01-06-2017

Generally, you won't be able to run R on your laptop/workstation and connect it remotely to the cluster. It's possible, but would require more setup and configuration, so I would avoid this deployment for now. Instead, run R on a cluster gateway node. You are using a standalone master, which isn't supported anyway. You would want to use YARN. Although you should be able to use your own copy of SparkR 1.6 with the cluster, I don't know if it works. It's not supported. sparklyr is another option, which at least is supported by RStudio.

srowen · ‎01-04-2017

Generally speaking, you will need to have connectivity from your laptop to at least one machine in the cluster (the gateway), and have some local configuration for sparklyr that indicates where the cluster is. I haven't tried this with sparklyr, but for other R-Hadoop libraries like rhdfs, it means having a copy of the HADOOP_CONF_DIR files from the cluster locally. It also means you probably need the same version of Spark binaries locally as are on the cluster. This is challenging. You may be better off running sparklyr directly on the edge/gateway node of the cluster. See https://blog.cloudera.com/blog/2016/09/introducing-sparklyr-an-r-interface-for-apache-spark/ Instead of installing Spark, point it to a non-local master like "yarn-client" to use the cluster. SparkR is also something you can try to get working. You would probably need to use an upstream sparkr version that's similar to the CDH Spark you're using (1.x vs 2.x) and then just try to run a ./bin/sparkr from its distirbution. Standalone mode isn't supported. None of these (sparkr, sparklyr) are supported by Cloudera, and so have no relationship to CM. You should not modify your existing Spark service and shouldn't have to.

srowen · ‎01-02-2017

It sounds like you only finished step 2. You need to finish all of them.

srowen · ‎01-02-2017

There is no CDH 5.12. Spark 2 is availabled as a CSD. Please follow the documented steps for installing it, which never includes manually copying JARs. http://www.cloudera.com/documentation/spark2/latest/topics/spark2_installing.html

srowen · ‎12-16-2016

It won't be terribly different -- like a maintenance release generally contains a small number of fixes -- but yes you will want to update it in general. You will need the GA version if you want production support, too.

srowen · ‎11-16-2016

No, the repo I'm referencing is the one single Cloudera repo where all artifacts are hosted.

srowen · ‎11-16-2016

https://github.com/OryxProject/oryx/tree/master/deploy/bin You should in general look at https://github.com/OryxProject/oryx

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: QuickStart VM requires 8 GB of RAM. My Laptop ...

Re: Permission Error while running spark-shell

Re: Permission Error while running spark-shell

Re: Run SparkR | or R package on my Cloudera 5.9 S...

Re: Run SparkR | or R package on my Cloudera 5.9 S...

Re: Spark 2

Re: Spark 2

Re: Spark 2 - official and beta

Re: Maven Repository for Spark2.0 beta?

Re: What dependencies to submit Spark jobs program...