About srowen

srowen · ‎09-12-2014

It will make a difference insofar as the driver program will run either out on the cluster (yarn-cluster) or locally (yarn-client). The same issue remains -- the processes need to talk to each other on certain ports. But it affects where the driver is and that affects what machine's ports need to be open. For example, if your ports are all open within your cluster, I expect that yarn-cluster works directly.

srowen · ‎09-12-2014

I believe it was added in 1.1, yes. I don't have a streaming app driver handy, so maybe double-check -- you will see an obvious Streaming tab if it's there. Without guaranteeing anything, I think the next CDH will have 1.1, and at any time you can run your own Spark jobs with any version under YARN.

srowen · ‎09-12-2014

Yes there is a special Streaming tab in the latest Spark driver UI.

srowen · ‎09-12-2014

The code here doesn't do work, really. It sets up and configures work. It expresses where data comes from, how it is transformed, and where it goes. No work is done until ssc.start(). So timing the code before doesn't help. You can already see some timing information in the Spark Streaming UI. You can try computing timing within the functions, since that will time them at the time of execution. However, even methods like .join() called in the transform() function are themselves transformations, that don't do work immediately. It would not help to time that one. Actions like foreach would make sense to time. Really I would start by looking at Spark's built-in timing metrics.

srowen · ‎09-12-2014

Have a look at: https://spark.apache.org/docs/latest/configuration.html#networking I think you are interested in fixing the driver and executor ports to a fixed value, rather than let them be chosen randomly. Same with the UI ports, if you're interested in those.

srowen · ‎09-11-2014

On its face it means what it says -- the master is unable to talk to the worker. I would check your firewall rules and make sure these machines can talk to each other, and on these ports. Spark picks ephemeral ports so you may have to open ranges.

srowen · ‎09-11-2014

It's nothing to do with Akka per se. This says your jobs are failing. You would have to look at the logs on the workers to understand why.

srowen · ‎09-10-2014

I think you imported just about everything except the one thing you need to get implicit conversions that unlock the functions in PairRDDFunctions, which is where join() is defined. You need: import org.apache.spark.SparkContext._ In the shell this is imported by default.

srowen · ‎09-05-2014

No, your data stays out on the cluster. What happens to it depends on what you want to do with it. For example if you want to save it to HDFS, you simply call saveAsHadoopFiles(). This writes the distributed data to the distributed file system. You do not in general pull back data to the driver, certainly not a whole large data set. It's the cluster doing work, not the driver, in general.

srowen · ‎09-05-2014

Ah if you just want to see a bit of the data, try something like .take(10).foreach(println). Data is already distributed by virtue of being in HDFS. Spark will send computation to the workers. So it's all inherently distributed. The exception are methods whose purpose is explicitly to return data to the driver, like collect(). You don't need to tell Spark to keep data in memory or not. It will manage without any intervention. However you can call methods like .cache() to explicitly save the RDD's state into blocks in memory and break its lineage. (You can do the same and put it on disk, or in a combination of disk and memory.) This is appropriate when you are reusing an RDD many times, but otherwise not necessary for you to manage.

Online	Offline
Last Visited	‎02-06-2015 02:06 PM

Member Since	‎07-29-2013 08:58 AM
Last Visited	‎02-06-2015 02:06 PM
Posts	366
Kudos received	62

Cloudera Community

Re: CDH 5.6

Re: How to use Oryx 1 to detect spam email

Re: Spark program in eclipse

Re: Graphx in latest CDH

Re: Maturity ORYX

Re: Akka Error while running Spark Jobs

Re: Metrics for a Spark Streaming Operation

Re: Metrics for a Spark Streaming Operation

Re: Metrics for a Spark Streaming Operation

Re: Akka Error while running Spark Jobs

Re: Akka Error while running Spark Jobs

Re: Akka Error while running Spark Jobs

Re: Joining Streaming Data with HDFS File

Re: Memory Issues in while accessing files in Spar...

Re: Memory Issues in while accessing files in Spar...