Support Questions

Find answers, ask questions, and share your expertise

reduceByKey versus reduceByKey2

avatar
New Contributor

Assume two versions for ReduceByKey

 

def reduceByKey(f, kvs, acc):
   s = shuffle(kvs)
   return map(lambda p: (p[0], reduce(f,p[1],acc)), s)
 
which works as: 
 
reduceByKey(lambda x,y:x+y, [("k1",1),("k2",1), ("k1",2), ("k2",3)],0) 
yields
[('k1', 3), ('k2', 4)]
 
 
and 
def reduceByKey2(agg, kvs):
   return map(agg, shuffle(kvs))
 
with 
reduceByKey2(lam
 
bda kvs:(kvs[0], sum(kvs[1])), [("k1",1),("k2",1), ("k1",2), ("k2",3)])
which yields
[('k1', 3), ('k2', 4)]
 
 
and shuffle: 
 
def merge(kvls1, kvls2):
   if len(kvls1) == 0: return kvls2
   elif len(kvls2) == 0: return kvls1
   else:
      ((k1,vl1), tl1) = (kvls1[0], kvls1[1:])
      ((k2,vl2), tl2) = (kvls2[0], kvls2[1:])
      if k1 == k2: return [(k1,vl1+vl2)]+merge(tl1,tl2)
      elif k1 < k2: return [(k1,vl1)]+merge(tl1,kvls2)
      else: return [(k2,vl2)]+merge(kvls1, tl2)

def shuffle(kvs):
   kvls = map(lambda kv: [(kv[0], [kv[1]])], kvs)
   return reduce(merge, kvls, [])
 
 
I am wondering about the key limitations with reduceByKey2? 
 
Also, what would be an operation that can be defined with reduceByKey2, but not by reduceByKey? I can only think of aggregations methods like median, but I think there are more. 
 
1 REPLY 1

avatar
Community Manager

@pentational Welcome to the Cloudera Community! I noticed that your post may be related to Spark based on some keywords used. To help you get the best possible solution, I have tagged our Spark experts @RangaReddy  and @Babasaheb  who may be able to assist you further.

 

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.

 



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: