I have an HDP v2.5.0/HDFv2.0.1 cluster I need to make it secure for production. I don't want to kerberised the cluster for some political decision, and I know that is not a good idea for the production environment. Anyway, here are the steps I investigated so far.
- Integrate all of the possible applications with AD.
- User/group mapping with AD with SSSD.
- Install Ranger and Knox and integrate them with AD.
- Security hardening by firewall rules. (blocking every un-authenticatable application through firewall)
My question is first, did I miss any required step? Do you have any step by step guide for user/group mapping part?
Hi @Ali Nazemian, this is a broad question. The foundations of security are Authentication, Authorization, and Audit (AAA) as well as Confidentiality, Integrity, and Availability (CIA).
Strong authentication using Kerberos, fine-grained authorization using Ranger, perimeter security with network architecture and the use of gateway services like Knox are all important aspects. One thing you didn't mention was protecting data at-rest with encryption, which is something worth considering for sensitive data sets.
Yes, SSSD (or a solution like Centrify) is important to map AD users to local Linux users in order to launch YARN containers in a secure cluster.
Let's start with the first point about "integrating the applications with AD" but not kerberising the cluster, as I am not understanding that point. Strong authentication using Kerberos is the foundation for all security functionality; without kerberizing the cluster, users can lie about their identities to the services, and therefore authorization policies are irrelevant.
I think we need to drill into the nature of the "political decision" you mention. Perhaps an architecture like the following would be more palatable organizationally: use a local KDC like MIT-KDC or FreeIPA to kerberize the cluster services, and then implement a one-way trust between AD and the local KDC so that users in AD can authenticate as appropriate.
Perimeter security is really a separate concern from authentication. It's about security in depth.
You could create an application proxy tier and enforce authentication at the application tier if you don't want human users to interface directly with Hadoop services (but it's uncommon to disallow direct access to all users), but that doesn't eliminate the need to kerberize the Hadoop services. For one thing, what's stopping the application from impersonating another app and escalating its privileges when connecting to a Hadoop service?
Regarding your question about AD privileges to create, modify, and delete users, that's actually not a requirement. It's only required if you want to use the Ambari wizard to kerberize the cluster. Your AD administrator can create the principals manually if using the wizard violates your organizational policies. That said, those privileges are restricted to delegated authority within a particular OU anyhow.
My problem is with the requirement of having "Create, delete, and manage user accounts" on AD for kerberising. Do you know why we need such access for this purpose? It doesn't make sense for me.
A fresh cluster is configured in an insecure state by default. Essentially, anyone with access to the nodes can submit jobs and gain access to the data. Ancient security protocols figured firewalls were sufficient to protect systems but as every breach teaches us, that's not at all the case. @slachterman is absolutely correct in his assessment so I won't duplicate what he has already stated. Perhaps I can add some color about the Kerberization process itself to make that a little more clear.
When a new cluster is built, each service is configured to start under the credentials of a local user on each node. Without centralized identity management, you may also be dealing with local user accounts on each node. Open source solutions such as SSSD, can help centralize identity into Active Directory but were never designed to support large scale production deployments or complex Active Directory environments. It also lacks the granularity to limit enumeration only to authorized users and a host of other essential security needs. This is why @slachterman mentioned Centrify as a premium capability in this realm. It provides all of these features and more to make the cluster production-ready in a greatly simplified way.
When a cluster is Kerberized, all of those local accounts for each service are abandoned and a new service principal account is created in Active Directory. It follows MIT Kerberos standards so you will have one account per principal in the format of service/host@realm (e.g. http/hdp1n1.acme.com@ACME.COM). In a larger cluster, there may be dozens of these accounts so you'll definitely want Ambari to create them for you and distribute the associated keytabs to each node. You will need ldaps connectivity to a domain controller to do so, so be prepared to do a little certificate management as well. Centrify handles this as part of their LDAP Proxy configuration which will not only help you with setting up secure communications, but also be used later for your application integrations (Ambari, Ranger, etc.). Once configured, all cluster services will be centrally authenticated and only users with valid service tickets will be able to access cluster resources.
Again, you can create completely separate architectures such as LDAP/MIT Kerberos, dedicated Active Directory forests or use tools like SSSD but all of those options just make it much more complex, harder to manage and deploy, and less secure as a result. If you have a corporate directory already, and users in that directory will need access to the Hadoop resources, it only makes sense to use the same directory and Kerberos realm.