Member since
08-11-2014
481
Posts
92
Kudos Received
72
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2726 | 01-26-2018 04:02 AM | |
5725 | 12-22-2017 09:18 AM | |
2704 | 12-05-2017 06:13 AM | |
3001 | 10-16-2017 07:55 AM | |
8326 | 10-04-2017 08:08 PM |
08-07-2016
10:02 PM
The first operation makes each value into a set containing that single value. ++ just adds collections together, combining elements of both sets. This is trying to build up a set of all values for each key. It can be written more simply as "groupByKey" really. Even this code could be more compact and efficient.
... View more
07-22-2016
12:49 AM
Here you just ran the scala shell. You have to use the spark-shell to use Spark.
... View more
07-05-2016
01:45 PM
Installing Anaconda doesn't make Pyspark use it. You would have to tell Pyspark to do so. I was referring to the Anaconda parcel for CDH, which does the setup, not the generic Anaconda distribution.
... View more
07-05-2016
10:00 AM
The simplest explanation is that pandas isn't installed, of course. It's not part of Python. Consider using the Anaconda parcel to lay down a Python distribution for use with Pyspark that contains many commonly-used packages like pandas.
... View more
06-27-2016
09:28 AM
1 Kudo
I have a guess: you need to make each of those things a separate arg tag? I don't know Oozie well myself, but something similar is needed in Maven config files. That is it may be reading this as one arg not two, called "-xm mapreduce"
... View more
06-13-2016
03:46 PM
Yes it would be. The execution of transformations/actions is the same, just the source is different.
... View more
06-13-2016
08:13 AM
You could always use a trivial case class to contain either the desired output, or a complete error description, and process accordingly. Downstream processing would filter in only results with output and take the output; error counting action would filter out results with output and count different errors. That's not hard -- maybe it helps as a potential way forward.
... View more
06-13-2016
07:58 AM
You can just run the counting on the output of the transformation, which presumably will contain a "null" or something for records that failed to parse. The only thing is that you'll want to persist that data set to avoid recomputing it. That means some extra I/O, but on the upside, does mean it's persisted for all future stages as well.
... View more
06-13-2016
07:45 AM
I think something got lost there -- you can increment accumulators in a transformation. The point above was just that nothing happens until something later invokes an action. The only caveat here is that accumulators may double-count in some failure contexts. Or, you can run an action to do this counting with accumulators (reliably), and then also separately do other transformations for whatever purpose you need. It's not like one RDD can only result in one action or transformation. That's the simple, standard answer. There are other answers still. It's most certianly possible, even easy.
... View more
06-07-2016
12:31 PM
This means the JVM took more memory than YARN thought it should. Usually this means you need to allocate more overhead, so that more memory is requested from YARN for the same size of JVM heap. See the spark.yarn.executor.memoryOverhead option, which defaults to 10% of the specified executor memory. Increase it.
... View more