Community Articles

Find and share helpful community-sourced technical articles.
Master Guru

Running Apache MXNet Deep Learning on YARN 3.1 - HDP 3.0


With Hadoop 3.1 / HDP 3.0, we can easily run distributed classification, training and other deep learning jobs. I am using Apache MXNet with Python. You can also do TensorFlow or Pytorch.

If you need GPU resources, you can specify them as such:

My cluster does not have an NVidia GPU unfortunately.



Running App on YARN

[root@princeton0 ApacheDeepLearning101]# ./
18/09/21 15:31:22 INFO distributedshell.Client: Initializing Client
18/09/21 15:31:22 INFO distributedshell.Client: Running Client
18/09/21 15:31:22 INFO client.RMProxy: Connecting to ResourceManager at
18/09/21 15:31:23 INFO client.AHSProxy: Connecting to Application History server at
18/09/21 15:31:23 INFO distributedshell.Client: Got Cluster metric info from ASM, numNodeManagers=1
18/09/21 15:31:23 INFO distributedshell.Client: Got Cluster node info from ASM
18/09/21 15:31:23 INFO distributedshell.Client: Got node report from ASM for,,, nodeRackName=/default-rack, nodeNumContainers=4
18/09/21 15:31:23 INFO distributedshell.Client: Queue info, queueName=default, queueCurrentCapacity=0.4, queueMaxCapacity=1.0, queueApplicationCount=8, queueChildQueueCount=0
18/09/21 15:31:23 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=SUBMIT_APPLICATIONS
18/09/21 15:31:23 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=ADMINISTER_QUEUE
18/09/21 15:31:23 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=SUBMIT_APPLICATIONS
18/09/21 15:31:23 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=ADMINISTER_QUEUE
18/09/21 15:31:23 INFO distributedshell.Client: Max mem capability of resources in this cluster 15360
18/09/21 15:31:23 INFO distributedshell.Client: Max virtual cores capability of resources in this cluster 12
18/09/21 15:31:23 WARN distributedshell.Client: AM Memory not specified, use 100 mb as AM memory
18/09/21 15:31:23 WARN distributedshell.Client: AM vcore not specified, use 1 mb as AM vcores
18/09/21 15:31:23 WARN distributedshell.Client: AM Resource capability=<memory:100, vCores:1>
18/09/21 15:31:23 INFO distributedshell.Client: Copy App Master jar from local filesystem and add to local environment
18/09/21 15:31:24 INFO distributedshell.Client: Set the environment for the application master
18/09/21 15:31:24 INFO distributedshell.Client: Setting up app master command
18/09/21 15:31:24 INFO distributedshell.Client: Completed setting up app master command {{JAVA_HOME}}/bin/java -Xmx100m org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster --container_type GUARANTEED --container_memory 512 --container_vcores 1 --num_containers 1 --priority 0 1><LOG_DIR>/AppMaster.stdout 2><LOG_DIR>/AppMaster.stderr
18/09/21 15:31:24 INFO distributedshell.Client: Submitting application to ASM
18/09/21 15:31:24 INFO impl.YarnClientImpl: Submitted application application_1536697796040_0022
18/09/21 15:31:25 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=AM container is launched, waiting for AM container to Register with RM, appMasterHost=N/A, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=, appUser=root
18/09/21 15:31:26 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=AM container is launched, waiting for AM container to Register with RM, appMasterHost=N/A, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=, appUser=root
18/09/21 15:31:27 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=AM container is launched, waiting for AM container to Register with RM, appMasterHost=N/A, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=, appUser=root
18/09/21 15:31:28 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=, appMasterHost=princeton0/, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=, appUser=root
18/09/21 15:31:29 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=, appMasterHost=princeton0/, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=, appUser=root
18/09/21 15:31:30 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=, appMasterHost=princeton0/, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=, appUser=root
18/09/21 15:31:31 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=, appMasterHost=princeton0/, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=, appUser=root
18/09/21 15:31:32 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=, appMasterHost=princeton0/, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=, appUser=root
18/09/21 15:31:33 INFO distributedshell.Client: Got application report from ASM for, appId=22, clientToAMToken=null, appDiagnostics=, appMasterHost=princeton0/, appQueue=default, appMasterRpcPort=-1, appStartTime=1537543884622, yarnAppState=FINISHED, distributedFinalState=SUCCEEDED, appTrackingUrl=, appUser=root
18/09/21 15:31:33 INFO distributedshell.Client: Application has completed successfully. Breaking monitoring loop
18/09/21 15:31:33 INFO distributedshell.Client: Application completed successfully






yarn jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar -shell_command python3.6 -shell_args "/opt/demo/ApacheDeepLearning101/ /opt/demo/images/201813161108103.jpg" -container_resources memory-mb=512,vcores=1

For pre-HDP 3.0, see my older script using the DMLC YARN runner. We don't need that anymore. No Spark either.

Python MXNet Script:

Since we are distributed, let's write the results to HDFS. We can use and install the Python HDFS library that works on Python 2.7 and 3.x. So let's pip install it.

pip install hdfs

In our code:

from hdfs import InsecureClient
client = InsecureClient('', user='root')

from json import dumps
client.write('/mxnetyarn/' + uniqueid + '.json', dumps(row))

We write our row as JSON to HDFS.

When the job completes in YARN, we get a new JSON file written to HDFS.

hdfs dfs -ls /mxnetyarn 

Found 2 items 

-rw-r--r--   3 root hdfs        424 2018-09-21 17:50 /mxnetyarn/mxnet_uuid_img_20180921175007.json
-rw-r--r--   3 root hdfs        424 2018-09-21 17:55 /mxnetyarn/mxnet_uuid_img_20180921175552.json

hdfs dfs -cat /mxnetyarn/mxnet_uuid_img_20180921175552.json

{"uuid": "mxnet_uuid_img_20180921175552", "top1pct": "49.799999594688416", "top1": "n03063599 coffee mug", "top2pct": "21.50000035762787", "top2": "n07930864 cup", "top3pct": "12.399999797344208", "top3": "n07920052 espresso", "top4pct": "7.500000298023224", "top4": "n07584110 consomme", "top5pct": "5.200000107288361", "top5": "n04263257 soup bowl", "imagefilename": "/opt/demo/images/201813161108103.jpg", "runtime": "0"}

HDP Assemblies



Coming soon, Submarine is really cool new way.

See this awesome presentation from Strata NYC 2018 by Wangda Tan (Hortonworks):

See the quick start for setting Docker and GPU options:
