About srowen

srowen · ‎04-19-2017

You should probably use 2.4.0-SNAPSHOT, but, also use Spark 2.1.0 rather than 2.0.x The last error is one I just managed to find and fix today when I produced some other updates. Try again with the very latest code.

srowen · ‎04-18-2017

First, I'll tell you that this is Quite Complicated and confuses me too. Matching Spark and Kafka versions is tricky, exacerabated by multiple and incompatible Kafka APIs, multiplied by slight differences in which versions are shipped in what CDH package. (Yes there is no 2.4 yet, I put it in there as a 'preview'.) I recall that I am actually not able to get the 2.3 release to pass tests with upstream components and that's why it builds with the CDH profile enabled by default. I wanted to move on to Kafka 0.9 to enable security support. But this is supported by Spark 2.x's kafka-0_10 integration component. And that wasn't yet available for CDH because it didn't work with the CDH Kafka 0.9. But the kafka-0_8 component did work. But then that component didn't work when enable with standard Spark 2's distro. This is a nightmarish no-mans-land of version combos. However master (2.4) should be clearer since it moves on to Kafka 0.10. It does work, or at least passes tests, vs Spark 2.1 and Kafka 0.10. In fact, I have a to-do to update the CDH dependencies too to get it working in master. So: if you're making your own build for non-CDH components, can you try building 2.4 SNAPSHOT from master? If that's working I can hurry up getting the CDH part updated so we can cut a 2.4.0 release.

srowen · ‎04-01-2017

Don't set heap size this way. Use --driver-memory. This indicates you are actually setting the max heap smaller than your driver memory is configured elsewhere, perhaps in a .conf file.

srowen · ‎03-31-2017

It has nothing to do with Spark. This is the kind of error you get when HDFS is not working. It may be unable to start, still starting, or having some other problem.

srowen · ‎03-28-2017

I think the recommended way to manage this without using Anaconda is to use the Anaconda-based parcel for CDH, which will lay down a basic version of dependencies like numpy and should plumb the necessary configuration to use that.

srowen · ‎03-18-2017

This is due to a difference in the version of commons-lang3 you use and the one Spark does, generally. See https://issues.apache.org/jira/browse/ZEPPELIN-1977 for example. I believe you'll find that it's resolved in the latest Spark 2 release for CDH. http://community.cloudera.com/t5/Community-News-Release/ANNOUNCE-Spark-2-0-Release-2/m-p/51464#M161

srowen · ‎03-13-2017

Yes, the thrift server isn't shipped or supported, in part because it doesn't work with the later Hive shipped by CDH. I don't think it will build with this profile enabled.

srowen · ‎03-03-2017

It's likely a problem then in how you are loading the streaming data -- bottlenecked on reading it or something for example. There are a lot of unknowns here. You can look at streaming UI stats to get more information on what is taking a while

srowen · ‎03-03-2017

The major problem here is that you are making a client for every single data element. That's incredibly slow. Make one client per RDD partition. You need rdd.foreachPartition instead.

srowen · ‎02-16-2017

No, you shouldn't do this. Spark 2 has been GA for CDH for a while. Use the official Spark 2 CSD.

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: Oryx2 Kafka Broker Issue

Re: Oryx2 Kafka Broker Issue

Re: Can't specify 2G Heap

Re: Spark2 Unable to write to HDFS (Or Local)

Re: ImportError: No module named numpy (after re-d...

Re: Spark 2.0 App not working on cluster

Re: Build failed for spark thriftserver - CDH5.10....

Re: Spark : How to speedup foreachRDD?

Re: Spark : How to speedup foreachRDD?

Re: Multiple Spark version on the same cluster