I'm wondering how you guys dealing with environments segmentation in a big data world in a cloud context (yes, I'm relatively new to this).
Do you have like Dev, Qa, Prod clusters - each in separate subnet or even VNet, and then Edge nodes for each env in a DMZ subnet?
Or maybe one cluster, with dev, qa, prod folders on HDFS, separate Yarn queues, backed with Ranger?
While the multi-tenant features of HDP (e.g. YARN capacity scheduler, Ranger policies, HDFS quotas, etc.) could be used to combine Dev/QA/Prod environments into a single cluster, it is generally not recommended.
Managing a single cluster instead of three seems easier on the surface, but it is really not worth it. First of all, where are developers going to test against new versions if you only have one cluster? Combining Dev and QA may be an option, but is more of an organizational decision.
A configuration I like is Prod, DR/Ad-hoc, and Dev/QA. Most companies require a DR environment in sync with production. By making that DR environment read-only, you can run exploratory analytics and/or data science workloads using resources that would otherwise sit idle. Additionally, pulling the lower priority and unpredictable workloads out of production reduces the risk of missing SLAs.
Of course, all of this is use case dependent, and your mileage may vary. The best thing about "big data" technologies is how customizable and broadly applicable they are, and the worst thing is how customizable and broadly applicable they are :)