About srowen

srowen · ‎08-04-2014

Where does the out of memory exception occur? in your driver, or an executor? I assume it is an executor. Yes, you are using the default of 512MB per executor. You can raise that with properties like spark.executor.memory, or flags like --executor-memory if using spark-shell. It sounds like your workers are allocating 2GB for executors, so you could potentially use up to 2GB per executor and your 1 executor per machine would consume all of your Spark cluster memory. But more memory doesn't necessarily help if you're performing some operation that inherently allocates a great deal of memory. I'm not sure what your operations are. Keep in mind too that if you are caching RDDs in memory, this is taking memory away from what's available for computations.

srowen · ‎08-03-2014

The method is "textFile" not "textfile" https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.SparkContext

srowen · ‎07-29-2014

Bad news: not directly. the design goal here is real-time scoring. You could write a process that queries an embedded Serving Layer, or, calls to one via HTTP. It's a bit more overhead, but certainly works. The bulk recommend function is a hold-over from the older code base, really. There wasn't an equivalent for classification. Good news: since the output is a PMML model, and libraries like openscoring exist, you could fairly easily wire up a Mapper that loads a model and scores data.

srowen · ‎07-28-2014

It sounds like you did not add a dependency on Spark SQL in your project. The artifact ID is "spark-sql_2.10"

srowen · ‎07-11-2014

This might help a lot: http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ Yes you want Spark executors to end up colocated with datanodes or else data has to be accessed over the network a lot. It works but of course ideally workers all process only local data. You should get that if YARN nodemanagers are colocated with datanodes, since YARN is the thing running Spark's executors in its containers, when using YARN. Things get confusing because of at least two things. First, there are two different types of YARN deployment, although, I don't think they affect how you think about placing services. But second, there is also "standalone" mode, the default in 0.9.0 and what you are currently using, wherein you actually do separately control where Spark workers run, separately from YARN. I suppose I'd say the thing that matters is: datanodes and nodemanagers and spark worker services are present on all machines doing work.

srowen · ‎07-11-2014

You might wait for CDH 5.1.0, which will be released very soon. This deploys Spark 1.0.0+patches on YARN for you. "Node" means a machine on which you want to run Spark. "Namenode" is for example an HDFS concept. It is not directly related to Spark. You may choose to run a Spark process on a machine that happens to host the namenode, or not. This is why Spark is not describe in terms of, say, HDFS roles. You do not need to start the Spark master on the HDFS namenode. You didn't have to start the MR jobtracker on the namenode either. On a cluster I ran, I put the master on the namenode just since it's a simple default choice. But any machine that can see HDFS and YARN would be fine; it need not even be running other Hadoop services. You can easily choose which machines are the Spark workers and which is the master in Cloudera Manager. The Spark master is not the same thing as a client. Its role is like that of the jobtracker really. It would not be run outside the cluster. You may be thinking of a driver for your specific app. The Apache distro is indeed a tarball and it's up to you to deploy it and run it. The role of CDH is to package, deploy and run things for you. The packaging is not at all the same, although the contents (scripts, binaries) are of course the same. You would not try to paste the raw tarball onto CDH nodes. If you want to get adventurous, you can go to all machines and dig into /opt/cloudera/parcels/CDH/lib/spark and replace binaries with a newer compiled version. That's a manual process, and I suppose not 100% guaranteed to work, but you can try it.

srowen · ‎07-03-2014

There is no equivalent; Spark MLlib has just the bare bones of model building algorithms. You would write it yourself. It would not be too much Spark code to write though. Although calculating all-pairs is always potentially too huge to contemplate.

srowen · ‎06-30-2014

OK, I meant what are the Maven / SBT deps, but in any event I think Hadoop 0.20.2 is the problem. CDH5 is Hadoop 2.3, and the supplied Spark works with that. Your classpath on your cluster shows you've got Hadoop 0.20.2 classes also in the mix somehow. I don't know where those are coming from? that is the problem.

srowen · ‎06-30-2014

How did you compile your jar file -- against which Spark and Hadoop deps? It seems like something is missing from the classpath. Try executing this first and then re-running: export SPARK_PRINT_LAUNCH_COMMAND=1 That ought to make it print the command including classpath. The error is from the local driver, not the app on the cluster right?

srowen · ‎06-30-2014

(I think you have a typo in "WorkCountJob" but that's not the issue yet) Did you run: source /etc/spark/conf/spark-env.sh

Online	Offline
Last Visited	‎02-06-2015 02:06 PM

Member Since	‎07-29-2013 08:58 AM
Last Visited	‎02-06-2015 02:06 PM
Posts	366
Kudos received	62

Cloudera Community

Re: CDH 5.6

Re: How to use Oryx 1 to detect spam email

Re: Spark program in eclipse

Re: Graphx in latest CDH

Re: Maturity ORYX

Re: Spark app throwing java.lang.OutOfMemoryError:...

Re: Unable to use the Spark Conecxt Variable in th...

Re: Scoring data on hadoop with Oryx at large scal...

Re: error: object sql is not a member of package o...

Re: Spark on YARN in CDH-5

Re: Spark on YARN in CDH-5

Re: How to calculate the similarity of movies in S...

Re: Executing application with spark-class

Re: Executing application with spark-class

Re: Executing application with spark-class