Member since
08-11-2014
481
Posts
92
Kudos Received
72
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3030 | 01-26-2018 04:02 AM | |
6377 | 12-22-2017 09:18 AM | |
3062 | 12-05-2017 06:13 AM | |
3321 | 10-16-2017 07:55 AM | |
9497 | 10-04-2017 08:08 PM |
04-19-2017
06:24 AM
You should probably use 2.4.0-SNAPSHOT, but, also use Spark 2.1.0 rather than 2.0.x The last error is one I just managed to find and fix today when I produced some other updates. Try again with the very latest code.
... View more
04-18-2017
10:36 AM
1 Kudo
First, I'll tell you that this is Quite Complicated and confuses me too. Matching Spark and Kafka versions is tricky, exacerabated by multiple and incompatible Kafka APIs, multiplied by slight differences in which versions are shipped in what CDH package. (Yes there is no 2.4 yet, I put it in there as a 'preview'.) I recall that I am actually not able to get the 2.3 release to pass tests with upstream components and that's why it builds with the CDH profile enabled by default. I wanted to move on to Kafka 0.9 to enable security support. But this is supported by Spark 2.x's kafka-0_10 integration component. And that wasn't yet available for CDH because it didn't work with the CDH Kafka 0.9. But the kafka-0_8 component did work. But then that component didn't work when enable with standard Spark 2's distro. This is a nightmarish no-mans-land of version combos. However master (2.4) should be clearer since it moves on to Kafka 0.10. It does work, or at least passes tests, vs Spark 2.1 and Kafka 0.10. In fact, I have a to-do to update the CDH dependencies too to get it working in master. So: if you're making your own build for non-CDH components, can you try building 2.4 SNAPSHOT from master? If that's working I can hurry up getting the CDH part updated so we can cut a 2.4.0 release.
... View more
04-01-2017
03:16 AM
1 Kudo
Don't set heap size this way. Use --driver-memory. This indicates you are actually setting the max heap smaller than your driver memory is configured elsewhere, perhaps in a .conf file.
... View more
03-31-2017
02:39 PM
1 Kudo
It has nothing to do with Spark. This is the kind of error you get when HDFS is not working. It may be unable to start, still starting, or having some other problem.
... View more
03-28-2017
10:28 PM
I think the recommended way to manage this without using Anaconda is to use the Anaconda-based parcel for CDH, which will lay down a basic version of dependencies like numpy and should plumb the necessary configuration to use that.
... View more
03-18-2017
03:23 AM
This is due to a difference in the version of commons-lang3 you use and the one Spark does, generally. See https://issues.apache.org/jira/browse/ZEPPELIN-1977 for example. I believe you'll find that it's resolved in the latest Spark 2 release for CDH. http://community.cloudera.com/t5/Community-News-Release/ANNOUNCE-Spark-2-0-Release-2/m-p/51464#M161
... View more
03-13-2017
05:38 AM
Yes, the thrift server isn't shipped or supported, in part because it doesn't work with the later Hive shipped by CDH. I don't think it will build with this profile enabled.
... View more
03-03-2017
08:40 AM
It's likely a problem then in how you are loading the streaming data -- bottlenecked on reading it or something for example. There are a lot of unknowns here. You can look at streaming UI stats to get more information on what is taking a while
... View more
03-03-2017
07:44 AM
The major problem here is that you are making a client for every single data element. That's incredibly slow. Make one client per RDD partition. You need rdd.foreachPartition instead.
... View more
02-16-2017
06:23 AM
No, you shouldn't do this. Spark 2 has been GA for CDH for a while. Use the official Spark 2 CSD.
... View more