Member since
07-29-2013
366
Posts
69
Kudos Received
71
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6190 | 03-09-2016 01:21 AM | |
5031 | 03-07-2016 01:52 AM | |
15066 | 02-29-2016 04:40 AM | |
4741 | 02-22-2016 03:08 PM | |
5750 | 01-19-2016 02:13 PM |
08-04-2014
11:39 AM
Where does the out of memory exception occur? in your driver, or an executor? I assume it is an executor. Yes, you are using the default of 512MB per executor. You can raise that with properties like spark.executor.memory, or flags like --executor-memory if using spark-shell. It sounds like your workers are allocating 2GB for executors, so you could potentially use up to 2GB per executor and your 1 executor per machine would consume all of your Spark cluster memory. But more memory doesn't necessarily help if you're performing some operation that inherently allocates a great deal of memory. I'm not sure what your operations are. Keep in mind too that if you are caching RDDs in memory, this is taking memory away from what's available for computations.
... View more
08-03-2014
02:54 AM
1 Kudo
The method is "textFile" not "textfile" https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.SparkContext
... View more
07-29-2014
03:59 AM
1 Kudo
Bad news: not directly. the design goal here is real-time scoring. You could write a process that queries an embedded Serving Layer, or, calls to one via HTTP. It's a bit more overhead, but certainly works. The bulk recommend function is a hold-over from the older code base, really. There wasn't an equivalent for classification. Good news: since the output is a PMML model, and libraries like openscoring exist, you could fairly easily wire up a Mapper that loads a model and scores data.
... View more
07-28-2014
10:30 AM
It sounds like you did not add a dependency on Spark SQL in your project. The artifact ID is "spark-sql_2.10"
... View more
07-11-2014
06:49 AM
This might help a lot: http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ Yes you want Spark executors to end up colocated with datanodes or else data has to be accessed over the network a lot. It works but of course ideally workers all process only local data. You should get that if YARN nodemanagers are colocated with datanodes, since YARN is the thing running Spark's executors in its containers, when using YARN. Things get confusing because of at least two things. First, there are two different types of YARN deployment, although, I don't think they affect how you think about placing services. But second, there is also "standalone" mode, the default in 0.9.0 and what you are currently using, wherein you actually do separately control where Spark workers run, separately from YARN. I suppose I'd say the thing that matters is: datanodes and nodemanagers and spark worker services are present on all machines doing work.
... View more
07-11-2014
05:28 AM
You might wait for CDH 5.1.0, which will be released very soon. This deploys Spark 1.0.0+patches on YARN for you. "Node" means a machine on which you want to run Spark. "Namenode" is for example an HDFS concept. It is not directly related to Spark. You may choose to run a Spark process on a machine that happens to host the namenode, or not. This is why Spark is not describe in terms of, say, HDFS roles. You do not need to start the Spark master on the HDFS namenode. You didn't have to start the MR jobtracker on the namenode either. On a cluster I ran, I put the master on the namenode just since it's a simple default choice. But any machine that can see HDFS and YARN would be fine; it need not even be running other Hadoop services. You can easily choose which machines are the Spark workers and which is the master in Cloudera Manager. The Spark master is not the same thing as a client. Its role is like that of the jobtracker really. It would not be run outside the cluster. You may be thinking of a driver for your specific app. The Apache distro is indeed a tarball and it's up to you to deploy it and run it. The role of CDH is to package, deploy and run things for you. The packaging is not at all the same, although the contents (scripts, binaries) are of course the same. You would not try to paste the raw tarball onto CDH nodes. If you want to get adventurous, you can go to all machines and dig into /opt/cloudera/parcels/CDH/lib/spark and replace binaries with a newer compiled version. That's a manual process, and I suppose not 100% guaranteed to work, but you can try it.
... View more
07-03-2014
03:08 AM
There is no equivalent; Spark MLlib has just the bare bones of model building algorithms. You would write it yourself. It would not be too much Spark code to write though. Although calculating all-pairs is always potentially too huge to contemplate.
... View more
06-30-2014
07:38 AM
OK, I meant what are the Maven / SBT deps, but in any event I think Hadoop 0.20.2 is the problem. CDH5 is Hadoop 2.3, and the supplied Spark works with that. Your classpath on your cluster shows you've got Hadoop 0.20.2 classes also in the mix somehow. I don't know where those are coming from? that is the problem.
... View more
06-30-2014
07:30 AM
How did you compile your jar file -- against which Spark and Hadoop deps? It seems like something is missing from the classpath. Try executing this first and then re-running: export SPARK_PRINT_LAUNCH_COMMAND=1 That ought to make it print the command including classpath. The error is from the local driver, not the app on the cluster right?
... View more
06-30-2014
07:09 AM
(I think you have a typo in "WorkCountJob" but that's not the issue yet) Did you run: source /etc/spark/conf/spark-env.sh
... View more