Community Articles
Find and share helpful community-sourced technical articles
Super Guru

Running TensorFlow on YARN 3.1 with or without GPU


You have the option to run with or without Docker containers. If you are not using Docker containers you will need CUDA, TensorFlow and all your Data Science libraries.


Tips from Wangda

Basically GPU on YARN give you isolation of GPU device. Let's say a Node with 4 GPUS. First task comes ask 1 GPU. ( And YARN NM gives the task GPU0. Then the second task comes, ask 2 GPUs. And YARN NM gives the task GPU1/GPU2. So from TF perspective, you don't need to specify which GPUs to use. TF will automatically detect and consume whatever available to the job. For this case, task2 cannot see other GPUs apart from GPU1/GPU2.

If you wish to run Apache MXNet deep learning programs, see this article:


  • Install CUDA and Nvidia libraries if you have NVidia cards.
  • Install Python 3.x
  • Install Docker
  • Install PIP
  • sudo yum groupinstall 'Development Tools' -y
  • sudo yum install cmake git pkgconfig -y
  • sudo yum install libpng-devel libjpeg-turbo-devel jasper-devel openexr-devel libtiff-devel libwebp-devel -y
  • sudo yum install libdc1394-devel libv4l-devel gstreamer-plugins-base-devel -y
  • sudo yum install gtk2-devel -ysudo yum install tbb-devel eigen3-devel -y
  • pip3.6 install --upgrade pip
  • pip3.6 install tensorflow
  • pip3.6 install numpy -U
  • pip3.6 install scikit-learn -U
  • pip3.6 install opencv-python -U
  • pip3.6 install keras
  • pip3.6 install hdfs
  • git clone

You can see a docker example:





Run Command for an Example Classification

yarn jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar -shell_command python3.6 -shell_args "/opt/demo/DWS-DeepLearning-CrashCourse/ /opt/demo/images/photo1.jpg" -container_resources memory-mb=512,vcores=1

Without Docker

container_resources memory-mb=3072,vcores=1,

With Docker (Enable it first:

-shell_env YARN_CONTAINER_RUNTIME_TYPE=docker \    

-shell_env YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=<docker-image-name> \

Running a More Complex Training Job

This is the main example:

yarn jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar -shell_command python3.6 -shell_args "/opt/demo/models/tutorials/image/cifar10_estimator/ --data-dir=hdfs://default/tmp/cifar-10-data --job-dir=hdfs://default/tmp/cifar-10-jobdir --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=0" -container_resources memory-mb=512,vcores=1




Example Output

[hdfs@princeton0 DWS-DeepLearning-CrashCourse]$ python3.6
2018-10-15 02:37:23.892791: W tensorflow/core/framework/] Op BatchNormWithGlobalNormalization is deprecated. It will cease to work in GraphDef version 9. Use tf.nn.batch_normalization().
2018-10-15 02:37:24.181707: I tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

273 racer, race car, racing car 37.46013343334198%
274 sports car, sport car 25.35209059715271%
267 cab, hack, taxi, taxicab 11.118262261152267%
268 convertible 9.854312241077423%
271 minivan 3.2295159995555878%



Output Written to HDFS

hdfs dfs -ls /tfyarn
Found 1 items
-rw-r--r--   3 root hdfs        457 2018-10-15 02:35 /tfyarn/tf_uuid_img_20181015023542.json

hdfs dfs -cat /tfyarn/tf_uuid_img_20181015023542.json
{"node_id273": "273", "humanstr273": "racer, race car, racing car", "score273": "37.46013343334198", "node_id274": "274", "humanstr274": "sports car, sport car", "score274": "25.35209059715271", "node_id267": "267", "humanstr267": "cab, hack, taxi, taxicab", "score267": "11.118262261152267", "node_id268": "268", "humanstr268": "convertible", "score268": "9.854312241077423", "node_id271": "271", "humanstr271": "minivan", "score271": "3.2295159995555878"}


Full Source Code




Coming Soon


Don't have an account?
Version history
Revision #:
4 of 4
Last update:
‎03-11-2020 03:58 PM
Updated by:
Top Kudoed Authors