Recently I started managing a hadoop cluster with kerberos as authentication method, and in the actual configuration there is only 1 client, which is a virtual machine with limited memory. At this moment, all of the users are connecting to this client virtual machine and executing spark jobs on the cluster from there where they can authenticate using kerberos.
The problem is that this is clearly is a bottleneck.
My question is: how do you configure the client machines?
I would appreciate any kind of input in order to understand better the hadoop and the HDP architecture.
Thanks in advance and regards,
Why is this a bottleneck? As long as they run the spark job in yarn there should not be a big impact on the client node. So what is actually happening on the client node that is heavy? The general idea is to move as much as you can of that to the cluster.
There are some tasks you may have to do on an edge node ( compressing and uploading files for example ) But in that case you can add multiple edge/client nodes.
There is also the possibility to connect to the cluster with UIs like Hue/Zeppelin/Ambari Views to run commands ( Zeppelin for example can run Spark jobs )
Could you go into a bit more details of where you actually see problems what heavy tasks you run of the client and what kind of tasks apart from spark you would be running. That might help the community to help you.
@mkumar thanks for the link, I will check.
@bleonhardi , One of the bottlenecks is the fact that the users like to use some apps like jupyter notebooks. Considering that each notebook is consuming aroung 4GB of memory and the notebook is executed on the client, this would be the first bottleneck.
I would like to know what is the best option and the best practice regarding clients.
Now I have only 1 virtual machine as a client and I cannot deploy more virtual machines and I am thinking what are the alternatives.
Not sure what you expect to hear. In the end you can run a Zeppelin or Jupyter notebook server in the backend as well. New distributions of HDP come with a tech preview of Zeppelin as an ambari service.
But I am really not sure what kind of "best practices" you want to hear. In the end if end users have to run expensive client apps you need more client servers or you need to move the apps to a server. This has not really anything to do with hadoop
Thanks for your feedback. Just I wanted to hear how others are managing the client machines. In order to connect to a hadoop cluster you need a client, and I wanted to know how others are doing that (using virtual machines, hardware machines, how many, on client per user, one client for everyone, using users computers or dedicated hardware, ...)