Reply
Explorer
Posts: 29
Registered: ‎01-20-2017

pyspark Hung for RDD Join

Hi,

   I am using cloudera Qucikstart Vm 5.8.

   I am practicing Pyspark for joining Order and Order Items based on order Id and Aggretaged Revenue for each order from Order items.

 

But while joining the two RDD , the hung and not giving result after join.

 

Here is my code.

oRDD= sc.textFile("/user/cloudera/sqoop_import/orders")
oiRDD =sc.textFile("/user/cloudera/sqoop_import/order_items")
oCancelled = oRDD.map(lambda x:x.split(",")).filter(lambda x: ("CANCELED" in x[3]))
ordmap =oCancelled.map(lambda x: (int(x[0]),1))
oimap = oiRDD.map(lambda x: x.split(",")).map(lambda x:(int(x[1]),float(x[4])))
oiagg = oimap.reduceByKey(lambda x,y:x+y)
ojoin = oiagg.join(ordmap)
for i in ojoin.take(10): print(i)

I tried caching the joiner RDD but still it is hung. Any solution for this ?

Highlighted
Cloudera Employee
Posts: 4
Registered: ‎03-28-2017

Re: pyspark Hung for RDD Join

WIth out an associated error, it is hard to say what is happening. In general, PpSspark is slow. One way to have PySpark run faster is to use the Dataframe API (the pyspark.sql module), instead of the RDD api. That will almost certainly run faster.

Announcements