About srowen

srowen · ‎12-16-2015

First, you'd have to define what you're trying to "benchmark". I don't think these distributions vary in speed; they include reasonably different components around the core. That is, it's kind of like choosing a car solely by its max RPM or something, even if that's important to you.

srowen · ‎10-09-2015

One quick question -- are you running on Windows?

srowen · ‎09-30-2015

It's possible to just use a static Executor in your code and use it to run multi-threaded operations within each function call. This may not be efficient though. If your goal is simply full utilization of cores, then make sure you have enough executors with enough cores running to use all of your cluster. Then make sure your number of partitions is at least this large. Then each operation can be single-threaded.

srowen · ‎09-30-2015

You can't use RDDs inside functions on RDD executing remotely, which may be what you're doing. Otherwise i'm not clear what you are executing? I suspect you are doing something that does not work in general in Spark, but may happen to when executing locally in 1 JVM.

srowen · ‎09-28-2015

You won't be able to read a local file with this code. You are still trying to read from the classpath. I mean this would also have to change to read a file locally.

srowen · ‎09-27-2015

Here, you're using your own build of Spark against an older version of Hive than what's in CDH. That might mostly work, but you're seeing the problems in compiling and running vs different versions. I'm afraid you're on your own if you're rolling your own build, but, I expect you may get much closer if you make a build targeting the same HIve version in CDH.

srowen · ‎09-25-2015

The relationship of .jars and classloaders may not be the same as in local mode, such that this may not work as expected. Instead of depending on this, consider either distributing your file via HDFS, or using the --files option with Spark to distribute files to local disk: http://spark.apache.org/docs/latest/running-on-yarn.html

srowen · ‎09-22-2015

I remember some problems with snappy and HBase like this, like somehow an older version used by HBase ended up taking precedence in the app classloader and then it could not quite load properly, as it couldn't see the shared library in the parent classloader. This may be a manifestation of that one. I know there are certainly cases where there is no resolution to the conflict, since an app and Spark may use mutually incompatible versions of a dependency, and one will mess with the other if the Spark and app classloader are connected, no matter what their ordering. For this toy example, you'd just not set the classpath setting since it isn't needed. For your app, if neither combination works, then your options are probably to harmonize library versions with Spark, or shade your copy of the library.

srowen · ‎09-22-2015

Hm, but have you modified classpath.txt? IIRC the last time I saw this it was some strange problem with the snappy from HBase and one used by other things like Spark. Does it work without the userClassPathFirst arg? Just trying to narrow it down. This is always a problem territory, turning on this flag, but that's a simple example with no obvious reason it shouldn't work.

srowen · ‎09-22-2015

That's a different type of conflict. You have somehow a different version of snappy in your app classpath, maybe? You aren't including Spark/Hadoop in your app jar right? The Spark assembly only contains Hadoop jars if built that way, but in a CDH cluster, that's not a good idea, as the cluster already has its copy of Hadoop stuff. It's built as 'hadoop-provided' and the classpath then contains Hadoop jars and dependencies, plus Spark's. Modifying this means modifying the distribution for all applications. It may or may not work with the rest of CDH and may or may not work with other apps. These modifications aren't supported, though you can try whatever you want if you are OK with 'voiding the warranty' so to speak. Spark classpath issues are tricky in general, not just in CDH, since Spark uses a load of libraries and doesn't shade most of them. Yes, you can try shading your own copies as a fall-back if the classpath-first args don't work. But you might need to double-check what you are trying to bring in.

Online	Offline
Last Visited	‎02-06-2015 02:06 PM

Member Since	‎07-29-2013 08:58 AM
Last Visited	‎02-06-2015 02:06 PM
Posts	366
Kudos received	62

Cloudera Community

Re: CDH 5.6

Re: How to use Oryx 1 to detect spam email

Re: Spark program in eclipse

Re: Graphx in latest CDH

Re: Maturity ORYX

Re: Benchmark Cloudera, hortonworks and MapR

Re: SparkStreaming - ExitCodeException exitCode=13

Re: ERROR RDD transformations and actions can only...

Re: ERROR RDD transformations and actions can only...

Re: Spark : File not found error .... works fine i...

Re: Error when using HiveContext: java.lang.NoSuch...

Re: Spark : File not found error .... works fine i...

Re: Override libraries for spark

Re: Override libraries for spark

Re: Override libraries for spark