Created on 05-25-2016 11:55 PM - edited 08-17-2019 12:23 PM
In March, 2016 we delivered 2nd technical preview of Apache Zeppelin on HDP 2.4. Meanwhile we and the Zeppelin community continues to add new features. We now give you the final technical preview of Zeppelin, based on snapshot of Apache Zeppelin 0.6.0. The main features in this Zeppelin technical preview are:
Prerequisites:
This document provides instructions for :
Note, while both Ambari managed and Manual install instructions are provided, you only need to follow either one to get Zeppelin setup in your cluster.
HDP Cluster Requirement
This technical preview can be installed on any HDP 2.4.2 cluster, whether it is a multi-node cluster or a single-node HDP Sandbox. The following instructions assume that Spark (version 1.6 ) is already installed on the HDP cluster.
Ambari Managed Zeppelin Install
Step 1: Download the Zeppelin Ambari Stack Definition
On the node running Ambari server, run the following
sudo git clone https://github.com/hortonworks-gallery/ambari-zeppelin-service.git /var/lib/ambari-server/resources/stacks/HDP/2.4/services/ZEPPELIN
Step 2: Re-start Ambari Server
sudo service ambari-server restart
Step 3: Add Zeppelin Service with Ambari
Make sure to install Zeppelin Service to a node where Spark Client’s are installed.
Once Ambari comes back up and the services turn green, you can click on 'Add Service' from the 'Actions' dropdown menu in the bottom left of the Ambari dashboard:
Make a note of the node selected to run Zeppelin service, call this ZEPPELIN_HOST
On bottom left -> Actions -> Add service -> check Zeppelin service -> Next -> Next -> Next -> Deploy.
Accept all the default values and hit deploy button.
Step 4: Launch Zeppelin
Once Zeppelin is deployed, launch http://ZEPPELIN_HOST:9995 in your browser.
Try out included Zeppelin tutorial. There are a few Zeppelin notebooks available at Hortonworks Zeppelin Gallery. Please try them out.
The rest of the steps described in the doc are optional and needed for additional functionality around security.
Optional: Enable Zeppelin for Security
This section shows configuration to allow Zeppelin to authenticate end-user. Zeppelin uses Livy to execute jobs with Spark on YARN as the end user.
These are the high level steps to enable Zeppelin Security:
Note, when Zeppelin is authenticating end users, and Livy propagates the end-user identity to Hadoop, the end-user needs to exist on all nodes. In production you can leverage sssd or pam for this, but for now manually add user1 to all hosts in your cluster.
E.g to run as sample user “user1” run the below as OS root equivalent on all your worker nodes.
useradd user1 -g hadoop
As HDFS admin, create HDFS home for user1
su hdfs hdfs dfs -mkdir /user/user1 hdfs dfs -chown user1 /user/user1
Note if you configure Zeppelin to run as another user, you need to add that user to the OS and create HDFS home directory for that user.
Edit Zeppelin’s shiro config
On the node where Zeppelin server is installed, manually edit /usr/hdp/current/zeppelin-server/lib/conf/shiro.ini and ensure the following in URL section
[urls] /api/version = anon #/** = anon /** = authcBasic
You can use users defined in shiro for authentication. E.g enable the section to authenticate as user1/password2.
[users] admin = password1 user1 = password2 user2 = password3
Alternatively, to use LDAP as identity store by configuring the section below for your ldap.
[main] #ldapRealm = org.apache.shiro.realm.ldap.JndiLdapRealm #ldapRealm.userDnTemplate = cn={0},cn=engg,ou=testdomain,dc=testdomain,dc=com #ldapRealm.contextFactory.url = ldap://ldaphost:389 #ldapRealm.contextFactory.authenticationMechanism = SIMPLE
You can use Ambari to restart Zeppelin server. Ignore the error in Ambari during Zeppelin restart, Zeppelin starts fine.
Access Zeppelin-Tutorial and login as user1/password2 (or any user defined in your LDAP)
Note: Logout functionality is not available in this technical preview but is being added.
sudo yum install livy
Configure Livy Server
Create /etc/livy/conf/livy-env.sh with the following values. Ensure the path to Java is accurate for that node.
export SPARK_HOME=/usr/hdp/current/spark-client export JAVA_HOME=/usr/jdk64/jdk1.8.0_60 export PATH=/usr/jdk64/jdk1.8.0_60/bin:$PATH export HADOOP_CONF_DIR=/etc/hadoop/conf export LIVY_SERVER_JAVA_OPTS="-Xmx2g"
Create /etc/livy/conf/livy-defaults.conf with the following content.
livy.impersonation.enabled = true
On the node where Livy is installed, create ‘livy’ user to run the Livy process as user livy.
useradd livy -g hadoop
Create livy’s logs directory and grant user ‘livy’ permissions to write to it.
mkdir /usr/hdp/current/livy-server/logs chmod 777 /usr/hdp/current/livy-server/logs
On Livy node, edit /etc/spark/conf/spark-defaults.conf to add the following
spark.master yarn-client
Step 5: Grant user livy the ability to proxy users in Hadoop’s core-site.xml
Use Ambari to search add to the /etc/hadoop/conf/core-site.xml the following and restart HDFS with Ambari. See screenshot.
<property> <name>hadoop.proxyuser.livy.groups</name> <value>*</value></property><property> <name>hadoop.proxyuser.livy.hosts</name> <value>*</value> </property>
Step 6: Start Livy server
Launch Livy server as user ‘livy’
cd /usr/hdp/current/livy-server su livy ./bin/livy-server start
Step 7: Configure Zeppelin to use Livy
In Zeppelin, notebooks are run against the configured Interpreters. Go to your notebook and click on interpreter bindings.
On the next page select the interpreters you want to use. Note the interpreter selection is done via clicking on a interpreter in a toggle manner. The unselected interpreter appears in white color. You can reorder the interpreter available to your notebook by drag and drop of interpreter.
E.g below screenshot shows Livy Spark interpreter is selected ahead of Spark and launch with %lspark
Step 8: Confirm Livy Interpreter setting
Note the below Livy interpreter setting. If you have Livy installed on another node, replace localhost in the Livy url with the Livy host.
If you make any changes to Livy interpreter setting, make sure to re-start Livy interpreter.
Step 9: Run Notebooks with Livy Interpreter.
Livy support, Spark, SparkSQL, PySpark & SparkR. To run notes with Livy, make sure to use the corresponding magic string at the top of your note.
E.g %lspark for Scala code to run via Livy or %lspark.sql to run against SparkSQL via Livy.
To use SQLContext with Livy, make sure to not create any SQLContext explicitly since we create it by default. I.e. remove the following lines from your SparkSQL note.
//val sqlContext = new org.apache.spark.sql.SQLContext(sc) //import sqlContext.implicits._
With HDP 2.4.2, Zeppelin provides access control on each notebook. Click the lock icon on the notebook to configure access to that notebook.
On the next popup add users who should have access to the policy. Refer to below screenshot
Note with identity propagation enabled with Livy, the data access to controlled by the data source being accessed. E.g when you access HDFS as user1, the data access is controlled by HDFS permissions.
Often in the notebook you will want to use one or more libraries. For example, to run Magellan – you need to import its dependencies. To create a notebook to explore Magellan, you will need to include the Magellan library in your environment.
There are several ways in Zeppelin to include an external dependency.
Using the %dep interpreter. Note: this will only work for libraries that are published to Maven.
%dep z.load("group:artifact:version") %spark import... Here is an example to import dependency for Magellan %dep z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven") z.load("com.esri.geometry:esri-geometry-api:1.2.1") z.load("harsha2010:magellan:1.0.3-s_2.10") For more information, see https://zeppelin.incubator.apache.org/docs/interpreter/spark.html#dependencyloading. When you have a jar on the node where Zeppelin is running, the following approach can be useful: Add spark.files property at SPARK_HOME/conf/spark-defaults.conf; for example:spark.files /path/to/my.jar When you have a jar on the node where Zeppelin is running, this approach can also be useful: Add SPARK_SUBMIT_OPTIONS env variable to the ZEPPELIN_HOME/conf/zeppelin-env.sh file; for example:export SPARK_SUBMIT_OPTIONS="--packages group:artifact:version"
To stop the Zeppelin server, use Ambari. To stop Livy
su livy; cd /usr/hdp/current/livy-server; ./bin/livy-server stop
Created on 05-26-2016 03:09 PM
Does Zeppelin Impersonate users in this scenario ? How does impersonation work in this case (Zeppelin -> Livy -> Spark) ?
Btw thanks for this great article @vshukla
Created on 05-26-2016 06:55 PM
Yes, Zeppelin sends authenticated end-users via Livy downstream to Spark on YARN.
Livy adds --proxy-user <username> to the spark-submit command it launches.
Created on 05-27-2016 12:37 PM
brilliant article @vshukla.
...and some questions 😉 :
You mention a 'manual install' as well, how would it look like?
What if cluster is kerberized, what configs will change or need to be added ?
Will the above also work on HDP 2.3.4 incl. Spark 1.5.2 ? because I am currently fighting getting Zeppelin to work in kerberized HDP 2.3.4...
Thanks in advance, Gerd
Created on 08-08-2016 12:58 PM
Thanks for the great article.
I am running through the instructions, but when I try to add the Zeppelin service in Ambari (step 3) I can't choose any other host but one of my slaves. How can I change it to make Zeppelin be installed on my client host group (where spark-client is installed)?
Created on 08-09-2016 06:55 PM
@Yaron Idan You can also add spark client to a slave node and then add Zeppelin to that node.
Created on 09-09-2016 05:33 PM
This was very helpful, but Livy did not pick up /etc/livy/conf/livy-defaults.conf! I changed the name to /etc/livy/conf/livy.conf and impersonation worked.
Created on 10-16-2016 02:34 PM
Hi:
I have edit vi /usr/hdp/current/zeppelin-server/lib/conf/shiro.ini:
[urls] /api/version = anon #/** = anon /** = authcBasic [users] admin = admin hdfs = hdfsand restart zeppelin, but the login doesnt appear, just the anonimous user.
I need anything else ???