we have a 300 node cluster, each node having 132gb memory and 20 cores. the ask is - remove data from tableA which is in tableB and then merge B with A and push A to teradata.
below is the code
val ofitemp = sqlContext.sql("select * from B")
val ofifinal = sqlContext.sql("select * from A")
val selectfromfinal = sqlContext.sql("select A.a,A.b,A.c...A.x from A where A.x=B.y")
val takefromfinal = ofifinal.except(selectfromfinal)
val tempfinal = takefromfinal.unionAll(ofitemp)tempfinal.write.mode("overwrite").saveAsTable("C")
val tempTableFinal = sqlContext.table("C")tempTableFinal.write.mode("overwrite").insertInto("A")
the config used to run spark is -
HIVE_MAPPER_HEAP=2048 ## MB
with A and B having few million records, the job is taking several hours to run.
As am very new to Spark, am not understanding - is it the code issue or the environment setting issue.
would be obliged, if you can share your expert thoughts.