From your experience, what are the best practices for the following environments (development, testing, pre-production, production, data lab) in term of:
Hi @David Lays ... wow you don't want to know much in a single question! :o)
I'll try to give you an overview on each one, if you want to go deeper on any of them, I suggest a separate question per topic area.
So first of all, the below applies to all environments, regardless of dev/test/pre-prod/prod.
Now you have questions about master and slave node mixing and % of global data to store between the environments.
What I would say here is that there is a very strong emerging pattern in a lot of organisations that guides the decisions you make here.
First, you still need Dev, Test, Pre-Prod, Prod etc, but that's for testing your infrastructure.
i.e. whenever you upgrade to a new version of HDP, or add a new generation or vendor of hardware, or update a third party component such as SAS.... you run that through your Dev/Test etc etc clusters.
When it comes to your user base, that's a very different conversation.
With the datalake being a very real concept nowadays, and datalakes being truly multitennant and people being able to store and safely control access to a wide range of data, what we're seeing is that the data scientists, developers and many other categories of users including those that would usually have been on a separate scaled down silo, are actually using resources on the production datalake.
Their resource queues are managed so they can't impact production jobs or users, and maybe in some cases they can only access anonymised data, rather than data containing the full PII (personally identifiable information). But they can also test and develop their programs and hypothesis against a scale of data that just isn't possible in a "data lab".
One thing that you don't mention is DR (disaster recovery) we often see these assets also being used as areas for Developers and Data Science users to also be working on, and in the event of a DR situation, a separate set of capacity scheduler queues are deployed so production workloads take precidence until the DR conditions are resolved.
Hope that helps, it's a complex situation but this should set you on the right path.