Created on 10-07-201911:12 PM - edited 10-07-201911:15 PM
Motivation: As zeppelin is used for Data Ingestion/Discovery/Analytics/Visualization, the muti-purpose notebook's usage is increasing tremendously pushing users to find a way to scale out. Intention of the article is to give a fair idea on how zeppelin can be scaled out with the current available options in 2.6.x
Disclaimer: This article is based on my personal experience and knowledge. Don't take it as a standard guidelines, understand the concept and modify it for your environmental best practices and use case.
Beefing up the Zeppelin/interpreter Memory, tweaking the performance affecting factors will help to a certain extend but when the usage of notebook increase it is wise to scale out.
In this article we are going discuss on various considerations and best practices while scaling out. Note: Please refer Key factors that affects Zeppelin's Performance before deciding to scaling out, It will be also having the best practices to maintain zeppelin Environment
Benefits of multiple instances:
Load sharing and performance
Zeppelin currently don't have HA, while adding more instance with loadbalancer(not part of HDP) we can achieve it.
Adding multiple instances of zeppelin is not exposed in UI because fo few limitations. (No HA support without a external tool, notebook created in one instance of zeppelin will not be visible until you manually re-load the notebooks or restart the instance are the few limitations I am aware off)
Two things needs to be considered according to your use case when setting up multiple instances:
Shared: From HDP 2.6.3 onwards we have HDFS as storage, so all the zeppelin instances can share same storage and access the notebooks and configuration hassle free. If you are using older HDP version refer the "Multi - Tenancy & HA" section of article Zeppelin Best Practices
Dedicated: For complete isolation dedicated storage can be given for each instances (Discussed in detail below) 2) Accessibility
Dedicated Instance: We can dedicatedly give users access to particular instance.
Benefits: No external tool required, easy to implement and can be considered as partial HSA, since the storage is common even with one node down business critical operation can be continued
Disadvantages: Ineffective use of resources. (Ex. when 10 users are using in the first instance, the second and third might be used only by 1 or two users hence the load will be hig on one instance whereas the resource in others are available.
Dedicated Instance with dedicated storage and configuration: We can dedicatedly give users access to particular instance and each instance with separate HDFS or local storage space using config groups
Benefits: Complete Isolation of configuration for different set of users
Disadvantages: Ineffective use of resources.
Using Load balancer: Having a loadbalancer with Round-robin load balancing is one of the simplest solution.
Benefits: Effective use of resource, can consider as full HA
Disadvantages: Need a external Tool and need proper maintenance
Below is the process to install multiple zeppelin instance