Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark2 taking longer time than spark1

Highlighted

Spark2 taking longer time than spark1

New Contributor

I am upgrading spark version from 1.6 to 2.3. As part of the change I made some config changes.

Meanwhile, the job takes 3-4 times more time for completion as compared spark 1.

When I compared both the versions and identified the specific point where spark2 seems to be taking more time.

 

val filtersDf: List[DataFrame] = inputData.filter(_.isDefined).map(_.get)
// Convert modified po dataframe to json
val postProcessModifiedPoList = cleanedObjModifiedPoLists.foldLeft(Seq[String]())((modVoJSONs, data) => {
  logger.info("Convert to json: dataframe count {}", data.count().toString)
  logger.info("Convert to json: dataframe columns {}", data.columns.mkString(","))
  logger.info("Convert to json: Size of dataframe {}", SizeEstimator.estimate(data).toString)
  modVoJSONs ++ data.toJSON.collect()
})

 

When checked the logs, It seems to be hanging at one of the point and started showing below logs(10-15 min) and start removing the executors, As a result only 3 executors are left for processing final stage. which takes more time. We are using dynamic allocation and maximum executor count is 30 and min is 3.

 

ContextCleaner: Cleaned accumulator ...
Removing executor as it has been idle for 60 sec

 

Sample output

 

Convert to json: dataframe count 4
dataframe columns 7 columns only without huge data
Convert to json: Size of dataframe 508918002

 

Don't have an account?
Coming from Hortonworks? Activate your account here