Created 05-01-2017 06:25 PM
What options does one have to run TensorFlow in distributed mode using GPUs under YARN on a Hadoop cluster while leveraging GPUs ?
For exploiting GPUs, one needs the CUDA and CUDNN libraries installed. For running TF in distributed mode, I understand that there is a way to do this with TF, but I don't believe there is a way of doing this directly under YARN today. Is this correct ?
One option to run TF in distributed mode is to run it on Spark under YARN, but there seems to be multiple ways to achieve this integration:
- TensorFrames: Experimental on Spark 2.1.
- TensorFlowOnSpark by Yahoo
- DeepLearning4J by Skymind
...
What's the proven approach?
Created 05-01-2017 07:32 PM
Hi @zhoussen, please see Hortonworks's recent blog on TF assemblies running on YARN.
As discussed in the blog, this relies on some YARN JIRAs that are targeted for HDP 3.0.
Created 05-01-2017 08:07 PM
Hi @slachterman,
Thanks. Yes, I'm aware of the Data Lake 3.0 roadmap. But, the solution is based on Docker container support. This implies a radically different way of managing a bare-metal Hadoop cluster. I'm looking at ways of achieving this on a Hadoop 2.x based cluster today.
Created 06-03-2017 11:55 PM
Hi @zhoussen
I am also trying to find an approach for tensorflow. Do you figure it out ?
Created 06-06-2017 01:18 PM
I've just listed above the 3 ways I found discussed the most often, but I didn't get any definitive answer.
Created 08-07-2017 12:36 PM
TensorFlow on Spark is run by Yahoo and working well. There is no official support for this yet
The DL4J guys are great and can run your Keras models. It is not TensorFlow, but you can get professional proven support from their team. Having talked with them many times, they are amazingly talented in Deep Learning and AI.