Support Questions

Find answers, ask questions, and share your expertise

ERROR YarnScheduler: Lost executor 7 on host remote Akka client disassociated

avatar

Hi Team, 

 

Facing below errors with spark sql. 

 

15/12/22 06:36:17 ERROR YarnScheduler: Lost executor 5 on xxxxx0075.us2.oraclecloud.com: remote Akka client disassociated
15/12/22 06:36:17 INFO TaskSetManager: Re-queueing tasks for 5 from TaskSet 1.0
15/12/22 06:36:17 WARN TaskSetManager: Lost task 33.0 in stage 1.0 (TID 32, xxxxx0075.us2.oraclecloud.com): ExecutorLostFailure (executor 5 lost)
15/12/22 06:36:17 WARN TaskSetManager: Lost task 11.0 in stage 1.0 (TID 2, xxxxx0075.us2.oraclecloud.com): ExecutorLostFailure (executor 5 lost)
15/12/22 06:36:17 WARN TaskSetManager: Lost task 30.0 in stage 1.0 (TID 22, xxxxx0075.us2.oraclecloud.com): ExecutorLostFailure (executor 5 lost)
15/12/22 06:36:17 WARN TaskSetManager: Lost task 78.0 in stage 1.0 (TID 52, xxxxx0075.us2.oraclecloud.com): ExecutorLostFailure (executor 5 lost)
15/12/22 06:36:17 WARN TaskSetManager: Lost task 24.0 in stage 1.0 (TID 12, xxxxx0075.us2.oraclecloud.com): ExecutorLostFailure (executor 5 lost)
15/12/22 06:36:17 WARN TaskSetManager: Lost task 47.0 in stage 1.0 (TID 42, xxxxx0075.us2.oraclecloud.com): ExecutorLostFailure (executor 5 lost)
15/12/22 06:36:17 INFO DAGScheduler: Executor lost: 5 (epoch 2)
15/12/22 06:36:17 INFO BlockManagerMasterActor: Trying to remove executor 5 from BlockManagerMaster.
15/12/22 06:36:17 INFO BlockManagerMaster: Removed 5 successfully in removeExecutor
15/12/22 06:36:20 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@xxxxx0084.us2.oraclecloud.com:39580/user/Executor#309269734] with ID 11
15/12/22 06:36:20 INFO TaskSetManager: Starting task 11.1 in stage 1.0 (TID 70, xxxxx0084.us2.oraclecloud.com, NODE_LOCAL, 1385 bytes)
15/12/22 06:36:20 INFO TaskSetManager: Starting task 28.1 in stage 1.0 (TID 71, xxxxx0084.us2.oraclecloud.com, NODE_LOCAL, 1386 bytes)
15/12/22 06:36:20 INFO TaskSetManager: Starting task 15.1 in stage 1.0 (TID 72, xxxxx0084.us2.oraclecloud.com, NODE_LOCAL, 1386 bytes)
15/12/22 06:36:20 INFO TaskSetManager: Starting task 37.1 in stage 1.0 (TID 73, xxxxx0084.us2.oraclecloud.com, NODE_LOCAL, 1386 bytes)
15/12/22 06:36:20 INFO TaskSetManager: Starting task 25.1 in stage 1.0 (TID 74, xxxxx0084.us2.oraclecloud.com, NODE_LOCAL, 1386 bytes)
15/12/22 06:36:20 INFO TaskSetManager: Starting task 82.1 in stage 1.0 (TID 75, xxxxx0084.us2.oraclecloud.com, NODE_LOCAL, 1386 bytes)
15/12/22 06:36:20 INFO BlockManagerMasterActor: Registering block manager xxxxx0084.us2.oraclecloud.com:46844 with 5.2 GB RAM, BlockManagerId(11, xxxxx0084.us2.oraclecloud.com, 46844)
15/12/22 06:36:20 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on xxxxx0084.us2.oraclecloud.com:46844 (size: 3.7 KB, free: 5.2 GB)
15/12/22 06:36:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on xxxxx0084.us2.oraclecloud.com:46844 (size: 28.7 KB, free: 5.2 GB)
15/12/22 06:36:27 INFO TaskSetManager: Starting task 104.0 in stage 1.0 (TID 76, xxxxx0080.us2.oraclecloud.com, NODE_LOCAL, 1386 bytes)
15/12/22 06:36:27 WARN TaskSetManager: Lost task 85.0 in stage 1.0 (TID 65, xxxxx0080.us2.oraclecloud.com): java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.nio.ByteBuffer.wrap(ByteBuffer.java:373)
at parquet.io.api.Binary$ByteArraySliceBackedBinary.toStringUsingUTF8(Binary.java:91)
at org.apache.spark.sql.parquet.CatalystPrimitiveStringConverter.addBinary(ParquetConverter.scala:478)
at parquet.column.impl.ColumnReaderImpl$2$6.writeValue(ColumnReaderImpl.java:318)
at parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:206)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:152)
at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:147)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

15/12/22 06:36:28 ERROR YarnScheduler: Lost executor 6 on xxxxx0080.us2.oraclecloud.com: remote Akka client disassociated

 

Spark version - 1.3.0 

CDH - 5.4.0

 

Thanks

Kishore 

3 REPLIES 3

avatar
Master Collaborator

@TheKishore432 Hi where you able to solve the issue?

avatar

Hi Fawze, 

 

Earlier when I executed query table has more than 1K partitions so we splitted tables and ran the same query on small tables and issue is resolved. Since spark can't keep all the content in memory it is failing with GC error. 

 

Thanks

Kishore

avatar
Master Collaborator
Thanks,

Indeed in my case the memory I assigned to the executor was overrides by
the memory passed in the workflow so the executors were running with 1 GB
instead of 8GB.

I fixed it by passing the memory in the workflow xml