Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark job taking long time # Code or Environment Issue?

Highlighted

Spark job taking long time # Code or Environment Issue?

New Contributor

we have a 300 node cluster, each node having 132gb memory and 20 cores. the ask is - remove data from tableA which is in tableB and then merge B with A and push A to teradata.

below is the code

val ofitemp = sqlContext.sql("select * from B")
val ofifinal = sqlContext.sql("select * from A")
val selectfromfinal = sqlContext.sql("select A.a,A.b,A.c...A.x from A where A.x=B.y")
val takefromfinal = ofifinal.except(selectfromfinal)
val tempfinal = takefromfinal.unionAll(ofitemp)tempfinal.write.mode("overwrite").saveAsTable("C")
val tempTableFinal = sqlContext.table("C")tempTableFinal.write.mode("overwrite").insertInto("A")

the config used to run spark is -

EXECUTOR_MEM="16G"
HIVE_MAPPER_HEAP=2048   ## MB
NUMBER_OF_EXECUTORS="25"
DRIVER_MEM="5G"
EXECUTOR_CORES="3"

with A and B having few million records, the job is taking several hours to run.

As am very new to Spark, am not understanding - is it the code issue or the environment setting issue.

would be obliged, if you can share your expert thoughts.


Don't have an account?
Coming from Hortonworks? Activate your account here