Support Questions
Find answers, ask questions, and share your expertise

Best practices for various environments

Best practices for various environments



From your experience, what are the best practices for the following environments (development, testing, pre-production, production, data lab) in term of:

  • High availability of master nodes
  • High availability of edge nodes (knox, clients, etc)
  • Security (kerberos, knox, ranger, etc)
  • Master and slave node mixing
  • % global of data to store
  • Etc



Re: Best practices for various environments

Hi @David Lays ... wow you don't want to know much in a single question! :o)

I'll try to give you an overview on each one, if you want to go deeper on any of them, I suggest a separate question per topic area.

So first of all, the below applies to all environments, regardless of dev/test/pre-prod/prod.

  • HA of Masters
    • Do it. Full NN HA is production ready and I barely ever see clusters without it nowadays.
  • HA of edge nodes
    • Knox is stateless, to make it HA you just spin up multiple instances of it and put them behind a session aware load balancer
    • All clients wherever possible should go through knox, or ambari views. Ambari views can also be spun up in the same way as Knox (as described above).
  • Security
    • Kerberos - Do it. Without Kerberos it doesn't matter what else you layer on top, your cluster is insecure, think about a large safe with a big secure door on the front, but no side walls... that kind of security is what you have without Kerberos.
    • Ranger - Do it, policies can be group based, groups can be inherited from AD or LDAP
    • Knox - strongly recommended, also prevents you have to update users each time your internal cluster services move around, they just keep talking to Knox and Knox does the internal mappings

Now you have questions about master and slave node mixing and % of global data to store between the environments.

What I would say here is that there is a very strong emerging pattern in a lot of organisations that guides the decisions you make here.

First, you still need Dev, Test, Pre-Prod, Prod etc, but that's for testing your infrastructure.

i.e. whenever you upgrade to a new version of HDP, or add a new generation or vendor of hardware, or update a third party component such as SAS.... you run that through your Dev/Test etc etc clusters.


When it comes to your user base, that's a very different conversation.

With the datalake being a very real concept nowadays, and datalakes being truly multitennant and people being able to store and safely control access to a wide range of data, what we're seeing is that the data scientists, developers and many other categories of users including those that would usually have been on a separate scaled down silo, are actually using resources on the production datalake.

Their resource queues are managed so they can't impact production jobs or users, and maybe in some cases they can only access anonymised data, rather than data containing the full PII (personally identifiable information). But they can also test and develop their programs and hypothesis against a scale of data that just isn't possible in a "data lab".

One thing that you don't mention is DR (disaster recovery) we often see these assets also being used as areas for Developers and Data Science users to also be working on, and in the event of a DR situation, a separate set of capacity scheduler queues are deployed so production workloads take precidence until the DR conditions are resolved.

Hope that helps, it's a complex situation but this should set you on the right path.

Good luck!