About srowen

srowen · ‎08-19-2015

You would need to contact Cloudera Support if you believe it's a problem, if you have a support contract. I have successfully added Spark Gateway nodes after a cluster is live though without issues, so I suspect it's something else at work here.

srowen · ‎08-18-2015

It sounds like you want to have one process, not two then, if the two phases are so tied together. Also consider using a message queue like kafka and spark streaming to process the output of one separate job in another in near-real-time. I would not over-complicate it. Tachyon is also an option but as far as I know it's not necessarily finished or completely integrate with Spark. I don't know if it will be.

srowen · ‎08-18-2015

An RDD is bound to an application, so it can't be shared across apps. You simply persist the data (e.g. on HDFS) and read it from the other app as an RDD. I know people think that is slow, or slow-er than sharing an RDD somehow, but it isn't if you think about what's necessary to maintain fault tolerance across apps. You'd still be persisting something somewhere besides memory. And HDFS caching can make a lot of the reading from HDFS an in-mem operation anyway.

srowen · ‎08-06-2015

I don't think it has to do with functional programming per se, but yes, it's because the function/code being executed has to be sent from the driver to the executors, and so the function object itself must be serializable. It has no relation to security.

srowen · ‎08-05-2015

If you call persist() on an RDD, it means that the data in the RDD will be persisted but only later when something causes it to be computed for the first time. It is not immediately evaluated.

srowen · ‎07-28-2015

That I don't know. THere should be something in the logs at startup, and that should be available pretty soon. I would expect you can see the logs with that command. It could be some other issue with the ports and so on, but then I think you'd see errors from YARN that it can't get to the AM container or something.

srowen · ‎07-28-2015

You can background the spark-submit process like any other linux process, by putting it into the background in the shell. In your case, the spark-submit job actually then runs the driver on YARN, so, it's baby-sitting a process that's already running asynchronously on another machine via YARN. Running is good; it means all is well. You can redirect this log output where you like. Killing the driver will cause YARN to restart it, in yarn-cluster mode. You want to kill the spark-submit process, really. I don't know why you don't see logs. Try browing to the Spark UI of the driver to see what's happening.

srowen · ‎07-27-2015

I suspect it's some issue in the version of tar you may have on your system? BSD vs Gnu? Just a guess. That or maybe a corrupted file? The latest rmr2 archive uncompressed OK for me on OS X. https://github.com/RevolutionAnalytics/rmr2/releases

srowen · ‎07-27-2015

The first case is: read - shuffle - persist - count The second case is: read (from persisted copy) - count You are right that coalesce does not always shuffle, but it may in this case. It depends on whether you started with more or fewer partitions. You should look at the Spark UI to see whether a shuffle occurred.

srowen · ‎07-26-2015

Hm, is that surprising? You described why it is faster in your message. The second time, "result" does not have to be recomputed since it is available on disk. It is the result of a potentially expensive shuffle operation (coalesce)

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: spark-submit on additional machine

Re: Share 1 RDD between 2 Spark applications (memo...

Re: Share 1 RDD between 2 Spark applications (memo...

Re: What is the reason behind Spark Functions exte...

Re: What does it mean, Spark persist call on its o...

Re: What is the correct way to start/stop spark st...

Re: What is the correct way to start/stop spark st...

Re: how to install RHadoop on CDH5.3

Re: Benefit of DISK_ONLY persists

Re: Benefit of DISK_ONLY persists