Created 01-24-2017 06:17 AM
Accessing Zeppelin Notebook by 15 users at the same time making it damn slow to execute their queries.
So how can we make Zeppelin more scale-able to handle various queries on nearly 50GB data from multiple users(say 30-50) at the same time without slowing the response time?
I understand Zeppelin uses memory for its execution, but we have 6 nodes of cluster with 27GB RAM, 8 Cores and 521 GB disk usage on each. And for reference Zeppelin is configured on one of the MASTER NODE.
Looking for some viable suggestions.
Many thanks in advance.
Created 01-24-2017 01:02 PM
> So what do you think is the actual problem here?
It is hard to say what is the problem. It depends on your cluster size and data size. According my experience, 512M for executor is too small usually. You might need to increase it to 4G, 4 cores for each executor. That means you can run at most 4 task per executor, each consume 1g memory. And you may also need to set the executor number. IIRC, the default is 2. But if you want it to be shared by many users, then just increase it.
> does it make any difference or will be okay if sparkContext and executors are correctly set?
The question is a little confusing. Do you use hive thrift server or spark thrift server ? If you are using hive thrift server, then it is not related to spark.
Created 01-24-2017 09:12 AM
It depends on whether these users share the same SparkContext. Because zeppelin only support yarn-client for spark interpreter which means the driver will run on the same host as zeppelin server. And if you run spark interpreter in shared mode, then all the user share the same SparkContext and you can increase executor size and executor number for the query needs. But if you run isolated mode (each user own its own sparkcontext, launching a new spark app). That would almost impossible for your cluster. Because there would be 30-50 drivers process on your zeppelin server which will eat up your resources.
Created 01-24-2017 12:54 PM
Thank you for the help @jzhang
So, what I believe, Zeppelin is configured using yarn-client for spark interpreter, and spark home is set to /usr/hdp/current/spark directory in same host(master). spark interpreter is already in shared mode, although spark.executor.memory is set to 512MB(Default). So what do you think is the actual problem here?
And what if users are executing HIVE queries using JDBC interpreter, does it make any difference or will be okay if sparkContext and executors are correctly set? zeppelin-1.pngzeppelin-2.pngzeppelin-3.png I am sharing zeppelin current configuration for reference.
Created 01-24-2017 01:02 PM
> So what do you think is the actual problem here?
It is hard to say what is the problem. It depends on your cluster size and data size. According my experience, 512M for executor is too small usually. You might need to increase it to 4G, 4 cores for each executor. That means you can run at most 4 task per executor, each consume 1g memory. And you may also need to set the executor number. IIRC, the default is 2. But if you want it to be shared by many users, then just increase it.
> does it make any difference or will be okay if sparkContext and executors are correctly set?
The question is a little confusing. Do you use hive thrift server or spark thrift server ? If you are using hive thrift server, then it is not related to spark.
Created 01-24-2017 01:37 PM
Thanks for the first answer, It makes sense to increase configured memory size from 512M to 4 GB, I would give a try and monitor the response time.
Sorry for not being too clear on 2nd question, Well, I meant to ask, Should we need to take care of tuning jdbc-interpreter configuration if users are executing HIVE queries on Zeppelin notebook to make it fast?
If not mistaken, Cluster is setup with Hive using hiveserver2 thrift. So, should we be upgrading this with spark thrift server in order to utilize the sparkContext configuration? ( Sorry for bothering you with such silly questions as I am new to this so keen to know about it.)
Created 01-24-2017 01:42 PM
hiveserver2 use tez as exeuction engine while spark thrift server use spark as undering execution engine.
There's nothing to configure on zeppelin side for performance tuning. You just need to either make changes on hiveserver2 or spark thrift server for performance tuning.
Created 02-05-2018 06:34 PM
Have you considered Spark Livy Interpreter , with that you won't be running multiple Spark Drivers on the Zeppelin node and free up some Memory and CPU usage, also supports user impersonation.
https://zeppelin.apache.org/docs/0.7.2/interpreter/livy.html
Created 09-20-2018 10:46 PM
Livy interpreter still does not support Zeppelin Context (e.g. doing `z.show`)