Everything is in the title. Lot of people speaking about dev, validation and prod envs. HDP offers great deal of features for security, multi-tenancy and operation which help using one physical cluster. Why do I need several clusters for my envs?
Any concrete examples of risks or problems that can happen?
In good health
Hi @Tim David. This is as much a philosophical argument as it is a logical one. I know I won't be the only one to drop an answer, but I'll get it started. ;-) Also, the answer is not meant to be preachy. There will be many people reading it, so this answer is meant for an audience with more of a beginner IT/Hadoop background.
1. Upgrades. At a minimum, you'll need a Dev environment to test upgrades to your cluster. This could even be in the cloud. You'll confirm that the upgrade works, that it does not cause problems with existing code, and can confirm the steps/scripts required to perform the upgrade successfully in production.
2. Tuning. Yarn, Hive & Tez in particular have a baffling array of "levers and switches" that you can change to affect performance. Container sizes, pre-warming, statistics, file types (ORC). Changing those settings in production can improve your processing but hurt others. Dev/QA environments let you play with local & global settings prior to rolling them out in production.
3. Isolation of data/Security. Having one environment means that people are testing against production data... or co-mingling dev data with production data. There are often datasets that contain data the development team should never see. Think PII (personally identifiable info) or PHI data (personal health info). Having a dev environment means that masked/minimized data is in the dev cluster where developers can test away to their hearts content without the security team worrying about failing a security audit (which can have terrible regulatory/financial consequences).
Also it is easier to pass an audit when you KNOW that the developers have zero access to production. If you only have one environment, you have all kinds of people granted access to that one production server, and you are proving that person X can only see data in section Y.
4. Human Nature - "We judge others by their actions, but judge ourselves by our intentions." Change Management often feels like an unnecessary roadblock or speed bump, but it's there for a reason. Small changes in one place can cause a ripple affect in others, that cannot be seen until after unit tests are complete and system tests take place. Having dev/QA/prod environments in-place and a change management process in effect greatly minimizes the affects of these unintended problems.
Good answer @bpreachuk. Agreed and I would be tempted to go down the segregated clusters path with the constraints I work in. But given a chance, I want to move away from this thinking/culture.
Today's technology support IaaC. If Organisation build true devops culture, #1 and #2 can easily be addressed by spinning up a cluster on the click of a button. I see them as small short lived projects with a defined start and a defined end. Also dedicated environments may not be needed for something which is infrequent.
#3 can be addressed using various combinations of the security features available out there and I see it is no different to two teams/users having different access privileges to the data in the same environment e.g production.
#4 can be addressed partially by change in culture and partially by technology segregating the environments within the cluster at storage layer and processing layer.
I have been in situations where different parts of organisations (sometime different programmes within the same sub-organisations) do not agree on things. e.g one wanting to upgrade for the new features available in the newer release and another not willing to pay for regression testing of their workloads resulting in proliferation of the software in production environments. I have also known some example of proliferation HDP clusters in production environment resulting in data ponds instead of building one big datalakes with a single HDP cluster. Sometimes I wonder if we are replicating the traditional problems in the big data set-ups.
In short, if you have a strong devops teams with strong expertise and appetite to IaaC, CI/CD and depending on your organisation's culture, I think single environment is definitely be possible. I believe if we aim high, we will achieve high.
Let me know your views.
Hi @Kiran Kumar. You make very good points. Virtual/Cloud environments make Points #1 and #2 easy to implement.
I think what is important is that organizations plan out & address all of the pain points that a single environment entails. To put it even simpler - make a conscious decision rather than just drive blindly. Understand the pros and cons, and make a wise decision involving all teams/stakeholders.
Process is also key. Having a change management culture is crucial.
If you yourself were to create a single environment as you describe in your answer, it would accomplish many if not all of the separation/requirements that multiple environments provide. Specifically your Point #4 answer - even on the same cluster you would be effectively creating multiple environments.
Absolutely. My thinking (although very hard to implement due to various constraints)
1. Exploit the newer technology as much as you can.
2. Empower and encourage the administrators, developers, tester, release managers to automate Infra, test, deploy etc. Encourage the teams to embrace devops culture.
3. Empower them do intelligent stuff with good quality rather than repetitive stuff with poor quality.
4. Spend money on training people rather than spending it on the tin.
5. Cutdown the lead times for doing things making organisation turely agile.
6. Cutdown the total cost of ownership for the business & to the projects.
We are now living in "disruptive era" so let our actions also be disruptive. These things will help organisations in the long run as business and projects do not cut corners while delivering projects. As an architect, I work hard in protecting the architecture view when there is a delivery cost and delivery risk double edge sword on your neck.