07-21-2017 02:19 AM
working on a CDH 5.11 (ADS/LDAP,Kerberos,Sentry enabled) Cluster.
Now we are evaluation a Notebooksolution.
Workbench (sadly ) does not support the same sql+spark+impala+hive features so we need to take a look beside.
Hue seems to be stop improving the notebook feature so this is out.
Jupyter seems to have everything but using one instance per user is a bit ... i dont like this approavh.
Zeppelin looks nice but im not sure how good it works together with CDH because the support seems to be poor.
I managed to connect zeppelin<->ads but now im facing the Kerberos stuff.
Is it possible that each zeppelin step uses and manages the kerberos tokens of the user logged in, or do i have to provide one "technical principal" which zeppelin uses.
Besides that question Hortonworks fully supports Zeppelin.
What is Clouderas answer? What notebook should we (blue elefant guys) use?
Workbench is a good one but does not cover all what zeppelin does and seems to point a bit different clients/peaple.
Thanks and BR
07-21-2017 02:34 AM
HUE is the supported and recommended tool for SQL (Impala, Hive).
The HUE notebook is not supported.
The workbench is the supported and recommended tool for Spark, Python, R, and Scala. Kerberos and security works.
Zeppelin, Jupyter are not supported and it's safe to say there are no plans to do so.
What features are you looking for? HUE + workbench should cover everything you mention. I don't know of a difference with Zeppelin in this respect.
What's a blue elephant guy?
07-21-2017 05:32 AM
thank you for your answer!
I know that i can cover sql/impala through hue and r+python+spark with workbench, but i dont like the approach.
Most of the time we use sql(impala/hive) for our quick data analysis and then we we go to python/spark to go deeper or dev/test our etl/elt parts.
When everything is fine we put the tested lines our dev ide (intellij+sbt...testing) and and deploy it.
So Zeppelin is a bit more en extention of hour ide dev process in one tool.
I know that the dev approach "should" be different when using the workbench but thats how our process is.
Zeppelin/jupyter solves it all in one for us.
("Blue elephant guys" should be funny for "Cloudera Users" compared to the green elephant used by Hortonworks)
07-21-2017 05:50 AM
You can execute SQL statements in Pyspark. Same metastore, same data as you are accessing from Hive or Impala.
I think one of the premises of the workbench is: edit code, not notebooks. Because that makes it much more realistic to create code that's then used in 'production'. The translation step is an obstacle.
I personally think you should use your IDE to do non-interactive software development, and use the workbench for the interactive parts, all within one project. This was my take on it, for Scala: https://github.com/srowen/cdsw-simple-serving
I think Jupyter is harder to fit into this vision because it's operating in terms of notebooks, not code at heart. Zeppelin, less so.
So I think we're actually aligned and think the workbench is trying to do what you want. But that's the answer we provide. You can use Zeppelin but you're on your own, and if it's a little tricky, well yeah that's part of the point.