Reading the dataframe using spark session val dframe = ss.read.option("inferSchema", value=true).option("delimiter", ",").csv("/home/balakumar/scala work files/matrimony.txt") create two tables from the input table val dfLeft = dframe.withColumnRenamed("_c1", "left_data")
val dfRight = dframe.withColumnRenamed("_c1", "right_data") Join and filter duplicates from the table val joined = dfLeft.join(dfRight , dfLeft.col("_c0") === dfRight.col("_c0") ).filter(col("left_data") !== col("right_data") ) write the joined output as csv val result = joined.select(col("left_data"), col("right_data") as "similar_ids" )
result.write.csv("/home/balakumar/scala work files/output") While running the above spark job in the cluster with following configuration it gets spread into following job Id's CLUSTER CONFIGURATION 3 NODE CLUSTER NODE 1 - 64GB 16CORES NODE 2 - 64GB 16CORES NODE 3 - 64GB 16CORES At Job Id 2 job is stuck at the stage 51 of 254 and then it starts utilising the disk space I am not sure why is this happening and my work is completely ruined . could someone help me on this
... View more
I faced a similar issue with the following error 18/12/10 11:33:12 ERROR YarnScheduler: Lost executor 2 on server1: Container marked as failed: container_e06_1544075636158_0018_01_000003 on host: server1. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143.
Killed by external signal The code was working fine with the default partition value of 19 but when partitionBy method was introduced and newHashpartitioner was increased to 38 , it throws the above error but the job was still running with no progress
... View more