About srowen

srowen · ‎08-07-2016

The first operation makes each value into a set containing that single value. ++ just adds collections together, combining elements of both sets. This is trying to build up a set of all values for each key. It can be written more simply as "groupByKey" really. Even this code could be more compact and efficient.

srowen · ‎07-22-2016

Here you just ran the scala shell. You have to use the spark-shell to use Spark.

srowen · ‎07-05-2016

Installing Anaconda doesn't make Pyspark use it. You would have to tell Pyspark to do so. I was referring to the Anaconda parcel for CDH, which does the setup, not the generic Anaconda distribution.

srowen · ‎07-05-2016

The simplest explanation is that pandas isn't installed, of course. It's not part of Python. Consider using the Anaconda parcel to lay down a Python distribution for use with Pyspark that contains many commonly-used packages like pandas.

srowen · ‎06-27-2016

I have a guess: you need to make each of those things a separate arg tag? I don't know Oozie well myself, but something similar is needed in Maven config files. That is it may be reading this as one arg not two, called "-xm mapreduce"

srowen · ‎06-13-2016

Yes it would be. The execution of transformations/actions is the same, just the source is different.

srowen · ‎06-13-2016

You could always use a trivial case class to contain either the desired output, or a complete error description, and process accordingly. Downstream processing would filter in only results with output and take the output; error counting action would filter out results with output and count different errors. That's not hard -- maybe it helps as a potential way forward.

srowen · ‎06-13-2016

You can just run the counting on the output of the transformation, which presumably will contain a "null" or something for records that failed to parse. The only thing is that you'll want to persist that data set to avoid recomputing it. That means some extra I/O, but on the upside, does mean it's persisted for all future stages as well.

srowen · ‎06-13-2016

I think something got lost there -- you can increment accumulators in a transformation. The point above was just that nothing happens until something later invokes an action. The only caveat here is that accumulators may double-count in some failure contexts. Or, you can run an action to do this counting with accumulators (reliably), and then also separately do other transformations for whatever purpose you need. It's not like one RDD can only result in one action or transformation. That's the simple, standard answer. There are other answers still. It's most certianly possible, even easy.

srowen · ‎06-07-2016

This means the JVM took more memory than YARN thought it should. Usually this means you need to allocate more overhead, so that more memory is requested from YARN for the same size of JVM heap. See the spark.yarn.executor.memoryOverhead option, which defaults to 10% of the specified executor memory. Increase it.

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: reduceByKey(_ ++ _)

Re: Error while importing log data from webserver ...

Re: PYSPARK import pandas

Re: PYSPARK import pandas

Re: Getting error while running mahout job through...

Re: Reliable method for keeping counters in Spark

Re: Reliable method for keeping counters in Spark

Re: Reliable method for keeping counters in Spark

Re: Reliable method for keeping counters in Spark

Re: ExecutorLostFailure Reason: Container killed ...