Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar
Rising Star

Motivation: As zeppelin is used for Data Ingestion/Discovery/Analytics/Visualization, the muti-purpose notebook's usage is increasing tremendously pushing users to find a way to scale out. Intention of the article is to give a fair idea on how zeppelin can be scaled out with the current available options in 2.6.x

Disclaimer: This article is based on my personal experience and knowledge. Don't take it as a standard guidelines, understand the concept and modify it for your environmental best practices and use case.

Beefing up the Zeppelin/interpreter Memory, tweaking the performance affecting factors will help to a certain extend but when the usage of notebook increase it is wise to scale out.

In this article we are going discuss on various considerations and best practices while scaling out. Note: Please refer Key factors that affects Zeppelin's Performance before deciding to scaling out, It will be also having the best practices to maintain zeppelin Environment

Benefits of multiple instances:

  1. Load sharing and performance
  2. Zeppelin currently don't have HA, while adding more instance with loadbalancer(not part of HDP) we can achieve it.
  3. You can create custom configuration groups with different configurations for each or set of zeppelin instances by using Manage Config Groups in Ambari. (Ex. you can have different shiro authentication altogether) Refer: https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.2.2/bk_ambari-operations/content/using_host_con...

Adding multiple instances of zeppelin is not exposed in UI because fo few limitations. (No HA support without a external tool, notebook created in one instance of zeppelin will not be visible until you manually re-load the notebooks or restart the instance are the few limitations I am aware off)

86452-reload.png

Two things needs to be considered according to your use case when setting up multiple instances: 

1) Storage

2) Accessibility

Storage:

  • Shared: From HDP 2.6.3 onwards we have HDFS as storage, so all the zeppelin instances can share same storage and access the notebooks and configuration hassle free. If you are using older HDP version refer the "Multi - Tenancy & HA" section of article Zeppelin Best Practices
  • Dedicated: For complete isolation dedicated storage can be given for each instances (Discussed in detail below) 2) Accessibility

Dedicated Instance: We can dedicatedly give users access to particular instance.

 Benefits: No external tool required, easy to implement and can be considered as partial HSA, since the storage is common even with one node down business critical operation can be continued

 Disadvantages: Ineffective use of resources. (Ex. when 10 users are using in the first instance, the second and third might be used only by 1 or two users hence the load will be hig on one instance whereas the resource in others are available.

 Dedicated Instance with dedicated storage and configuration: We can dedicatedly give users access to particular instance and each instance with separate HDFS or local storage space using config groups

 Benefits: Complete Isolation of configuration for different set of users

 Disadvantages: Ineffective use of resources.

Using Load balancer: Having a loadbalancer with Round-robin load balancing is one of the simplest solution.

 Benefits: Effective use of resource, can consider as full HA

 Disadvantages: Need a external Tool and need proper maintenance

 Below is the process to install multiple zeppelin instance

1. Command:

 

 

curl -u $AMBARI_USER:$AMBARI_PASSWORD -H 'X-Requested-By: ambari' -i -X POST -d '{"host_components" : [{"HostRoles":{"component_name":"ZEPPELIN_MASTER"}}] }' http://$AMBARI_HOSTNAME:8080/api/v1/clusters/$CLUSTER_NAME/hosts?Hosts/host_name=$NEW_HOST

 

 

Note: please replace $AMBARI_USER, $AMBARI_PASSWORD, $AMBARI_HOSTNAME, $CLUSTER_NAME, $NEW_HOST

2. In Ambari navigate to the host you specified in the above command and click 'Re-Install' for the Zeppelin Server component.

3,652 Views
Comments
avatar

Hi @fpaul as you know in our case we have a bit of experience using multiple Zeppelin instances :), and we have some concerns with that.

In case that you want to install multiple instances with a load balancer, you will have some sync problems about notebooks and configurations.

We tried to use this setup with HDFS as storage for the notebooks and configs to enable a single point to store this in order to share between the instances but unfortunatelly this doesn't work.

This is because as we understood, Zeppelin makes some kind of local copy and then it publish (let's say put) to hdfs. 

What we found it's that sometimes, if you access to Zeppelin, close the browser and open again the UI, you will see an unupdated version of the notebook, because it takes some time, and will take the last access and will publish this before the first one, let's say that there is no version control.

Another issue that we found it's that some times the user opens multiple browser taps and each one could go to a different instance of Zeppelin, and again you will have an inconsistent version of the notebook.

 

The best way that we found was deploy multiple instances of Zeppelin and do a manual balance of users, this is, allow some groups in one instance, another groups for the other instance and so one. 

Cheers!

avatar
Rising Star

@GerardReverte  thanks fort the inputs, yes as of now it's best to go for " Dedicated Instance with dedicated storage and configuration:" or "Dedicated Instance:" mentioned in the article for extensive or heavy production use. Thanks for sharing your experience, much appreciated.