What options does one have to run TensorFlow in distributed mode using GPUs under YARN on a Hadoop cluster while leveraging GPUs ?
For exploiting GPUs, one needs the CUDA and CUDNN libraries installed. For running TF in distributed mode, I understand that there is a way to do this with TF, but I don't believe there is a way of doing this directly under YARN today. Is this correct ?
One option to run TF in distributed mode is to run it on Spark under YARN, but there seems to be multiple ways to achieve this integration:
- TensorFrames: Experimental on Spark 2.1.
- TensorFlowOnSpark by Yahoo
- DeepLearning4J by Skymind
What's the proven approach?
As discussed in the blog, this relies on some YARN JIRAs that are targeted for HDP 3.0.
Thanks. Yes, I'm aware of the Data Lake 3.0 roadmap. But, the solution is based on Docker container support. This implies a radically different way of managing a bare-metal Hadoop cluster. I'm looking at ways of achieving this on a Hadoop 2.x based cluster today.
TensorFlow on Spark is run by Yahoo and working well. There is no official support for this yet
The DL4J guys are great and can run your Keras models. It is not TensorFlow, but you can get professional proven support from their team. Having talked with them many times, they are amazingly talented in Deep Learning and AI.