Support Questions

VJROCK · ‎04-16-2019

Hi All ,

I implemented model prediction in oozie workflow and i got error "Container is running beyond memory limits" on step 3 i.e. model1.predict_proba. Table1 has 27 Million records. It run fine on jyupiter notebook but i got this error on oozie. Can someone please help.

d1 = sqlContext.sql("SELECT * FROM table1").toPandas()
xyz= d1.drop(['abc'], axis = 1)

modelprob = model1.predict_proba(xyz)[:,1]

Error : Yarn Logs

Application application_1547693435775_8741566 failed 2 times due to AM Container for appattempt_1547693435775_8741566_000002 exited with exitCode: -104

For more detailed output, check application tracking page:https://xyz

Diagnostics: Container [pid=224941,containerID=container_e167_1547693435775_8741566_02_000002] is running beyond physical memory limits. Current usage: 121.2 GB of 121 GB physical memory used; 226.9 GB of 254.1 GB virtual memory used. Killing container.

2019-04-15 22:43:36,231 [dispatcher-event-loop-10] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_5_piece0 on xyz.corp.intranet:34252 in memory (size: 5.6 KB, free: 6.2 GB)
2019-04-15 22:43:36,231 [dispatcher-event-loop-35] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_5_piece0 on xyz1.corp.intranet:38363 in memory (size: 5.6 KB, free: 6.2 GB)
2019-04-15 22:43:36,242 [Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 4
2019-04-15 22:43:36,245 [dispatcher-event-loop-51] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_2_piece0 on xyz3 in memory (size: 53.5 KB, free: 52.8 GB)
2019-04-15 22:43:36,245 [dispatcher-event-loop-51] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_2_piece0 on xyz4.corp.intranet:46309 in memory (size: 53.5 KB, free: 6.2 GB)
2019-04-15 22:43:36,248 [dispatcher-event-loop-9] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_2_piece0 on xyz5.corp.intranet:44850 in memory (size: 53.5 KB, free: 6.2 GB)
2019-04-15 22:45:48,103 [SIGTERM handler] INFO org.apache.spark.deploy.yarn.ApplicationMaster - Final app status: FAILED, exitCode: 16
2019-04-15 22:45:48,106 [SIGTERM handler] ERROR org.apache.spark.deploy.yarn.ApplicationMaster - RECEIVED SIGNAL 15: SIGTERM
2019-04-15 22:45:48,124 [Thread-5] INFO org.apache.spark.SparkContext - Invoking stop() from shutdown hook

below are sparkconf parameters

sconf = SparkConf().setAppName("xyz model").set("spark.driver.memory", "8g").set('spark.executor.memory', '12g').set("spark.yarn.am.memory", "8g").set('spark.dynamicAllocation.enabled', 'true').set('spark.dynamicAllocation.minExecutors', 20').set('spark.dynamicAllocation.maxExecutors', '60').set("spark.shuffle.service.enabled", "true").set('spark.kryoserializer.buffer.max.mb', '2047').set("spark.shuffle.blockTransferService", "nio").set("spark.driver.maxResultSize", "4g").set('spark.rpc.message.maxSize', '330').setMaster("yarn-cluster")
sc = SparkContext(conf=sconf)

below are sprkopts parameters :

sparkopts=--executor-memory 115g --num-executors 60 --driver-memory 110g --executor-cores 16 --driver-cores 2 --conf "spark.dynamicAllocation.enabled=true" --conf "spark.kryoserializer.buffer.max=2047m" --conf "spark.driver.maxResultSize=4096m" --conf spark.yarn.executor.memoryOverhead=8000 --conf "spark.network.timeout=10000000" --conf "spark.executor.extraJavaOptions=-XX:+UseCompressedOops -XX:PermSize=2048M -XX:MaxPermSize=2048M -XX:+UseG1GC" --conf "spark.broadcast.compress=true" --conf "spark.broadcast.blockSize=128m" --conf "spark.serializer.objectStreamReset=2" --conf spark.executorEnv.PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python --files ${xyz}/hive-site.xml --files ${xyz}/yarn-site.xml

EricL · ‎05-11-2019

Hi,

It looks like that you running spark in cluster mode, and your ApplicationMaster is running OOM.

In cluster mode, the Driver is running inside the AM, I can see that you have Driver of 110G and executor memory of 12GB. Have you tried to increase both of them to see if it can help? How much I do not know, but maybe slowly increase to and keep trying.

However, the driver memory of 110GB seems to be a lot, am wondering what kind of dataset is this Spark job processing? How large is the volume?

Cheers
Eric

Cloudera Community

Support Questions

Container is running beyond memory limits - RECEIVED SIGNAL 15: SIGTERM