Support Questions

pentational · ‎03-01-2023

Assume two versions for ReduceByKey

def reduceByKey(f, kvs, acc):

s = shuffle(kvs)

return map(lambda p: (p[0], reduce(f,p[1],acc)), s)

which works as:

reduceByKey(lambda x,y:x+y, [("k1",1),("k2",1), ("k1",2), ("k2",3)],0)

yields

[('k1', 3), ('k2', 4)]

and

def reduceByKey2(agg, kvs):

return map(agg, shuffle(kvs))

with

reduceByKey2(lam

bda kvs:(kvs[0], sum(kvs[1])), [("k1",1),("k2",1), ("k1",2), ("k2",3)])

which yields

[('k1', 3), ('k2', 4)]

and shuffle:

def merge(kvls1, kvls2):

if len(kvls1) == 0: return kvls2

elif len(kvls2) == 0: return kvls1

else:

((k1,vl1), tl1) = (kvls1[0], kvls1[1:])

((k2,vl2), tl2) = (kvls2[0], kvls2[1:])

if k1 == k2: return [(k1,vl1+vl2)]+merge(tl1,tl2)

elif k1 < k2: return [(k1,vl1)]+merge(tl1,kvls2)

else: return [(k2,vl2)]+merge(kvls1, tl2)

def shuffle(kvs):

kvls = map(lambda kv: [(kv[0], [kv[1]])], kvs)

return reduce(merge, kvls, [])

I am wondering about the key limitations with reduceByKey2?

Also, what would be an operation that can be defined with reduceByKey2, but not by reduceByKey? I can only think of aggregations methods like median, but I think there are more.

VidyaSargur · ‎03-02-2023

@pentational Welcome to the Cloudera Community! I noticed that your post may be related to Spark based on some keywords used. To help you get the best possible solution, I have tagged our Spark experts @RangaReddy and @Babasaheb who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.

Regards,

Vidya Sargur,
Community Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

Cloudera Community

Support Questions

reduceByKey versus reduceByKey2

reduceByKey(_ ++ _)

kudu versus impala timestamp

Hadoop versus (SQL Server or ODI)

hbase performance with storage in the cloud versus...

Best Practices for Spark Programming - Part I

Understanding Spark through Map Reduce

Pyspark Streaming Wordcount Example

NiFi submitting Spark jobs in batch mode

Re: Apache Metron Tech Preview 1 - Come and Get It...

Comparing Performance of Cloudera Operational Data...