08-08-2017 11:47 PM
We use cloudera to deploy a zeppelin-spark-yarn-hdfs cluster. Right now, there's only one instance of zeppelin and spark, and the execution of all spark notebooks affects every user.
For instance, if we stop the spark context in a user's notebook, it affects all other user's notebooks. I've seen that there's an option in zeppelin to isolate interpreters, but is there a way to provide each user with its own 'cluster' on demand? Maybe using Docker and building an image with zeppelin and spark for each user, and limiting their resources to the ones provided by the user cluster? I'm quite lost as to how to implement it, or if it's even possible, but my ideal scenario would be an approach like databricks does. There you can have your own cluster, and all resources are isolated from other users.
Thanks in advance!
08-08-2017 11:57 PM
What you've described is basically what the Data Science Workbench does using Docker, and why it does it. It further isolates the whole user environment. Sharing one instance of Zeppelin / Spark means N users are logged in as 1 user, which wouldn't fly in a secured environment.
You don't need to make a private cluster. The point is you can share the shared / secured Spark / Hadoop cluster, which already can partition resources without separating them.
Cloudera does not support Zeppelin but obviously supports the Workbench and that's the recommended tool.
08-09-2017 04:08 AM
Thanks!! It looks like basically what I need, but I've seen that it runs under a Cloudera Enterprise License...for me now sadly it's not an option. Isn't there any other approach that doesn't involve the Enterprise version?