Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

I have written a simple logic to find association rules similar to that of collaborative filtering the logic works fine but we are facing run time issues in executing the job

New Contributor

Reading the dataframe using spark session

 val dframe = ss.read.option("inferSchema", value=true).option("delimiter", ",").csv("/home/balakumar/scala work files/matrimony.txt")


create two tables from the input table


val dfLeft = dframe.withColumnRenamed("_c1", "left_data")
val dfRight = dframe.withColumnRenamed("_c1", "right_data")

Join and filter duplicates from the table


  val joined = dfLeft.join(dfRight , dfLeft.col("_c0") === dfRight.col("_c0") ).filter(col("left_data") !== col("right_data") )

write the joined output as csv

 val result = joined.select(col("left_data"), col("right_data") as "similar_ids" )
 result.write.csv("/home/balakumar/scala work files/output")

While running the above spark job in the cluster with following configuration it gets spread into following job Id's

CLUSTER CONFIGURATION

3 NODE CLUSTER

NODE 1 - 64GB 16CORES

NODE 2 - 64GB 16CORES

NODE 3 - 64GB 16CORES

107895-works-untill-51-of-254.png


At Job Id 2 job is stuck at the stage 51 of 254 and then it starts utilising the disk space I am not sure why is this happening and my work is completely ruined . could someone help me on this

1 REPLY 1

New Contributor

The job is critical could someone help on this

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.