Reading the dataframe using spark session
val dframe = ss.read.option("inferSchema", value=true).option("delimiter", ",").csv("/home/balakumar/scala work files/matrimony.txt")
create two tables from the input table
val dfLeft = dframe.withColumnRenamed("_c1", "left_data")
val dfRight = dframe.withColumnRenamed("_c1", "right_data")
Join and filter duplicates from the table
val joined = dfLeft.join(dfRight , dfLeft.col("_c0") === dfRight.col("_c0") ).filter(col("left_data") !== col("right_data") )
write the joined output as csv
val result = joined.select(col("left_data"), col("right_data") as "similar_ids" )
result.write.csv("/home/balakumar/scala work files/output")
While running the above spark job in the cluster with following configuration it gets spread into following job Id's
CLUSTER CONFIGURATION
3 NODE CLUSTER
NODE 1 - 64GB 16CORES
NODE 2 - 64GB 16CORES
NODE 3 - 64GB 16CORES

At Job Id 2 job is stuck at the stage 51 of 254 and then it starts utilising the disk space I am not sure why is this happening and my work is completely ruined . could someone help me on this