Member since
08-11-2014
481
Posts
92
Kudos Received
72
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3454 | 01-26-2018 04:02 AM | |
7090 | 12-22-2017 09:18 AM | |
3538 | 12-05-2017 06:13 AM | |
3858 | 10-16-2017 07:55 AM | |
11233 | 10-04-2017 08:08 PM |
08-19-2015
12:45 AM
You would need to contact Cloudera Support if you believe it's a problem, if you have a support contract. I have successfully added Spark Gateway nodes after a cluster is live though without issues, so I suspect it's something else at work here.
... View more
08-18-2015
07:30 AM
It sounds like you want to have one process, not two then, if the two phases are so tied together. Also consider using a message queue like kafka and spark streaming to process the output of one separate job in another in near-real-time. I would not over-complicate it. Tachyon is also an option but as far as I know it's not necessarily finished or completely integrate with Spark. I don't know if it will be.
... View more
08-18-2015
06:53 AM
1 Kudo
An RDD is bound to an application, so it can't be shared across apps. You simply persist the data (e.g. on HDFS) and read it from the other app as an RDD. I know people think that is slow, or slow-er than sharing an RDD somehow, but it isn't if you think about what's necessary to maintain fault tolerance across apps. You'd still be persisting something somewhere besides memory. And HDFS caching can make a lot of the reading from HDFS an in-mem operation anyway.
... View more
08-06-2015
02:36 AM
I don't think it has to do with functional programming per se, but yes, it's because the function/code being executed has to be sent from the driver to the executors, and so the function object itself must be serializable. It has no relation to security.
... View more
08-05-2015
11:05 AM
If you call persist() on an RDD, it means that the data in the RDD will be persisted but only later when something causes it to be computed for the first time. It is not immediately evaluated.
... View more
07-28-2015
10:59 PM
That I don't know. THere should be something in the logs at startup, and that should be available pretty soon. I would expect you can see the logs with that command. It could be some other issue with the ports and so on, but then I think you'd see errors from YARN that it can't get to the AM container or something.
... View more
07-28-2015
01:33 PM
You can background the spark-submit process like any other linux process, by putting it into the background in the shell. In your case, the spark-submit job actually then runs the driver on YARN, so, it's baby-sitting a process that's already running asynchronously on another machine via YARN. Running is good; it means all is well. You can redirect this log output where you like. Killing the driver will cause YARN to restart it, in yarn-cluster mode. You want to kill the spark-submit process, really. I don't know why you don't see logs. Try browing to the Spark UI of the driver to see what's happening.
... View more
07-27-2015
01:41 PM
2 Kudos
I suspect it's some issue in the version of tar you may have on your system? BSD vs Gnu? Just a guess. That or maybe a corrupted file? The latest rmr2 archive uncompressed OK for me on OS X. https://github.com/RevolutionAnalytics/rmr2/releases
... View more
07-27-2015
04:16 AM
The first case is: read - shuffle - persist - count The second case is: read (from persisted copy) - count You are right that coalesce does not always shuffle, but it may in this case. It depends on whether you started with more or fewer partitions. You should look at the Spark UI to see whether a shuffle occurred.
... View more
07-26-2015
11:53 PM
Hm, is that surprising? You described why it is faster in your message. The second time, "result" does not have to be recomputed since it is available on disk. It is the result of a potentially expensive shuffle operation (coalesce)
... View more