Member since
07-29-2013
366
Posts
69
Kudos Received
71
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4143 | 03-09-2016 01:21 AM | |
3789 | 03-07-2016 01:52 AM | |
12228 | 02-29-2016 04:40 AM | |
3448 | 02-22-2016 03:08 PM | |
4464 | 01-19-2016 02:13 PM |
09-12-2014
06:49 AM
It will make a difference insofar as the driver program will run either out on the cluster (yarn-cluster) or locally (yarn-client). The same issue remains -- the processes need to talk to each other on certain ports. But it affects where the driver is and that affects what machine's ports need to be open. For example, if your ports are all open within your cluster, I expect that yarn-cluster works directly.
... View more
09-12-2014
06:23 AM
I believe it was added in 1.1, yes. I don't have a streaming app driver handy, so maybe double-check -- you will see an obvious Streaming tab if it's there. Without guaranteeing anything, I think the next CDH will have 1.1, and at any time you can run your own Spark jobs with any version under YARN.
... View more
09-12-2014
04:42 AM
Yes there is a special Streaming tab in the latest Spark driver UI.
... View more
09-12-2014
03:30 AM
The code here doesn't do work, really. It sets up and configures work. It expresses where data comes from, how it is transformed, and where it goes. No work is done until ssc.start(). So timing the code before doesn't help. You can already see some timing information in the Spark Streaming UI. You can try computing timing within the functions, since that will time them at the time of execution. However, even methods like .join() called in the transform() function are themselves transformations, that don't do work immediately. It would not help to time that one. Actions like foreach would make sense to time. Really I would start by looking at Spark's built-in timing metrics.
... View more
09-12-2014
12:45 AM
1 Kudo
Have a look at: https://spark.apache.org/docs/latest/configuration.html#networking I think you are interested in fixing the driver and executor ports to a fixed value, rather than let them be chosen randomly. Same with the UI ports, if you're interested in those.
... View more
09-11-2014
08:54 AM
On its face it means what it says -- the master is unable to talk to the worker. I would check your firewall rules and make sure these machines can talk to each other, and on these ports. Spark picks ephemeral ports so you may have to open ranges.
... View more
09-11-2014
03:13 AM
It's nothing to do with Akka per se. This says your jobs are failing. You would have to look at the logs on the workers to understand why.
... View more
09-10-2014
08:14 AM
I think you imported just about everything except the one thing you need to get implicit conversions that unlock the functions in PairRDDFunctions, which is where join() is defined. You need: import org.apache.spark.SparkContext._ In the shell this is imported by default.
... View more
09-05-2014
06:16 AM
2 Kudos
No, your data stays out on the cluster. What happens to it depends on what you want to do with it. For example if you want to save it to HDFS, you simply call saveAsHadoopFiles(). This writes the distributed data to the distributed file system. You do not in general pull back data to the driver, certainly not a whole large data set. It's the cluster doing work, not the driver, in general.
... View more
09-05-2014
05:13 AM
2 Kudos
Ah if you just want to see a bit of the data, try something like .take(10).foreach(println). Data is already distributed by virtue of being in HDFS. Spark will send computation to the workers. So it's all inherently distributed. The exception are methods whose purpose is explicitly to return data to the driver, like collect(). You don't need to tell Spark to keep data in memory or not. It will manage without any intervention. However you can call methods like .cache() to explicitly save the RDD's state into blocks in memory and break its lineage. (You can do the same and put it on disk, or in a combination of disk and memory.) This is appropriate when you are reusing an RDD many times, but otherwise not necessary for you to manage.
... View more