08-08-2017 11:47 PM
We use cloudera to deploy a zeppelin-spark-yarn-hdfs cluster. Right now, there's only one instance of zeppelin and spark, and the execution of all spark notebooks affects every user.
For instance, if we stop the spark context in a user's notebook, it affects all other user's notebooks. I've seen that there's an option in zeppelin to isolate interpreters, but is there a way to provide each user with its own 'cluster' on demand? Maybe using Docker and building an image with zeppelin and spark for each user, and limiting their resources to the ones provided by the user cluster? I'm quite lost as to how to implement it, or if it's even possible, but my ideal scenario would be an approach like databricks does. There you can have your own cluster, and all resources are isolated from other users.
Thanks in advance!
08-08-2017 11:57 PM
What you've described is basically what the Data Science Workbench does using Docker, and why it does it. It further isolates the whole user environment. Sharing one instance of Zeppelin / Spark means N users are logged in as 1 user, which wouldn't fly in a secured environment.
You don't need to make a private cluster. The point is you can share the shared / secured Spark / Hadoop cluster, which already can partition resources without separating them.
Cloudera does not support Zeppelin but obviously supports the Workbench and that's the recommended tool.
08-09-2017 04:08 AM
Thanks!! It looks like basically what I need, but I've seen that it runs under a Cloudera Enterprise License...for me now sadly it's not an option. Isn't there any other approach that doesn't involve the Enterprise version?
06-22-2018 01:25 AM - edited 06-22-2018 01:27 AM
I have setup Apache Zeppelin 0.7.3 with Cloudera CDH 5.15.x where each user is isolated. They run their own code in their own YARN queue (based on their username) which it has its own limits. They are not impacting each other at all.
I think what you are looking for is pretty much feasible with Zeppelin. Depending on using Livy or Spark context, both have been tested with my CDH and worked out for dozens of data scientists at our lab.
You also may want to take a look at DSW, it is now possible to deploy it by using Cloudera Manager much easier and with more OS supports. (not sure if it works on Cloudera Express)