03-25-2017 12:23 PM
I am using cloudera Qucikstart Vm 5.8.
I am practicing Pyspark for joining Order and Order Items based on order Id and Aggretaged Revenue for each order from Order items.
But while joining the two RDD , the hung and not giving result after join.
Here is my code.
oRDD= sc.textFile("/user/cloudera/sqoop_import/orders") oiRDD =sc.textFile("/user/cloudera/sqoop_import/order_items") oCancelled = oRDD.map(lambda x:x.split(",")).filter(lambda x: ("CANCELED" in x)) ordmap =oCancelled.map(lambda x: (int(x),1)) oimap = oiRDD.map(lambda x: x.split(",")).map(lambda x:(int(x),float(x))) oiagg = oimap.reduceByKey(lambda x,y:x+y) ojoin = oiagg.join(ordmap) for i in ojoin.take(10): print(i)
I tried caching the joiner RDD but still it is hung. Any solution for this ?
03-29-2017 09:05 AM
WIth out an associated error, it is hard to say what is happening. In general, PpSspark is slow. One way to have PySpark run faster is to use the Dataframe API (the pyspark.sql module), instead of the RDD api. That will almost certainly run faster.