About pentational

pentational · ‎03-01-2023

Assume two versions for ReduceByKey def reduceByKey(f, kvs, acc): s = shuffle(kvs) return map(lambda p: (p[0], reduce(f,p[1],acc)), s) which works as: reduceByKey(lambda x,y:x+y, [("k1",1),("k2",1), ("k1",2), ("k2",3)],0) yields [('k1', 3), ('k2', 4)] and def reduceByKey2(agg, kvs): return map(agg, shuffle(kvs)) with reduceByKey2(lam bda kvs:(kvs[0], sum(kvs[1])), [("k1",1),("k2",1), ("k1",2), ("k2",3)]) which yields [('k1', 3), ('k2', 4)] and shuffle: def merge(kvls1, kvls2): if len(kvls1) == 0: return kvls2 elif len(kvls2) == 0: return kvls1 else: ((k1,vl1), tl1) = (kvls1[0], kvls1[1:]) ((k2,vl2), tl2) = (kvls2[0], kvls2[1:]) if k1 == k2: return [(k1,vl1+vl2)]+merge(tl1,tl2) elif k1 < k2: return [(k1,vl1)]+merge(tl1,kvls2) else: return [(k2,vl2)]+merge(kvls1, tl2) def shuffle(kvs): kvls = map(lambda kv: [(kv[0], [kv[1]])], kvs) return reduce(merge, kvls, []) I am wondering about the key limitations with reduceByKey2? Also, what would be an operation that can be defined with reduceByKey2, but not by reduceByKey? I can only think of aggregations methods like median, but I think there are more.

Online	Offline
Last Visited	‎03-02-2023 12:25 AM

Member Since	‎03-01-2023 08:56 PM
Last Visited	‎03-02-2023 12:25 AM
Posts	1

Cloudera Community

reduceByKey versus reduceByKey2