Created on 05-15-201705:14 PM - edited 08-17-201912:59 PM
In the documentation of the particular projects you can find a number of details on how these components work on their own and on which services they rely. Since the projects are open source you can of course check out the source code for more information. Therefore, this article aims to summarize, rather than explain each process in detail.
In this article I am first going through some basic component descriptions to get an idea which services are in use. Then I explain the “security flow” from a user perspective (authentication –> impersonation (optional) –> authorization –> audit) and provide a short example using Knox.
When reading the article keep following figure in mind:
Knox serves as a gateway and proxy for Hadoop services and their UIs so that they can be accessible behind a firewall without requiring to open too many ports in the firewall.
For the newest HDP release (2.6.0) use these Knox Docs
Wire Encryption Concepts
To complete the picture I just want to mention that it is very important, to not only secure the access of services, but also encrypt data transferred between services.
Keystores and Truststores
To enable a secure connection (SSL) between a server and a client, first an encryption key needs to be created. The server uses it to encrypt any communication. The key is securely stored in a keystore for Java services JKS could be used. In order for a client to trust the server, one could export the key from the keystore and import it into a truststore, which is basically a keystore, containing keys of trusted services. In order to enable two-way SSL the same thing needs to be done on the client side. After creating a key in a keystore the client can access, put it into a trust store of the server. Commands to perform these actions are:
Generate key in “/path/to/keystore.jks” setting its alias to “myKeyAlias” and its password to “myKeyPassword”. If the keystore file “/path/to/keystore.jks” does not exist, this will command will also create it.
Responsible for issuing Ticket Granting Tickets (TGT)
Ticket Granting Server (TGS)
Responsible for issuing service tickets
Key Distribution Center (KDC)
Talks with clients using KRB5 protocol
AS + TGS
Contains user and group information and talks with its clients using the LDAP protocol.
Only a properly authenticated user (which can also be a service using another service) can communicate successfully with a kerberized Hadoop service. Missing the required authentication, in this case by proving the identity of both user and the service, any communication will fail. In a kerberized environment user authentication is provided via a ticket granting ticket (TGT).
Note: Not using KERBEROS, but SIMPLE authentication, which is set up by default, provides any user with the possibility to act as any other type of user, including the superuser. Therefore strong authentication using Kerberos is highly encouraged.
Technical Authentication Flow
User requests TGT from AS. This is done automatically upon login or using the kinit command.
User receives TGT from AS.
User sends request to a kerberized service.
User gets service ticket from Ticket Granting Server. This is done automatically in the background when user sends a request to the service.
User sends service a request to the service using the service ticket.
Authentication Flow from a User Perspective
Most of the above processes are hidden from the user. The only thing, the user needs to do before issuing a request from the service is to login on a machine and thereby receive a TGT or receive it programmatically or obtain it manually using the kinit command.
This is the second step after a user is successfully authenticated at a service. The user must be authenticated, but can then choose to perform the request to the service as another user. If everyone could do this by default, this would raise another security concern and the authentication process would be futile. Therefore this behaviour is forbidden by default for everyone and must be granted for individual users. It is used by proxy services like Apache Ambari, Apache Zeppelin or Apache Knox. Ambari, Zeppelin and Knox authenticate as “ambari”, “zeppelin”, “knox” users, respectively, at the service using their TGTs, but can choose to act on behalf of the person, who is logged in in the browser in Ambari, Zeppelin or Knox. This is why it is very important to secure these services.
To allow, for example, Ambari to perform operations as another user, set the following configs in the core-site.xml, hadoop.proxyuser.ambari.groups and hadoop.proxyuser.ambari.hosts, to a list of groups or hosts that are allowed to be impersonated or set a wildcard *.
Authorization defines the permissions of individual users. After it is clear which user will be performing the request, i.e., the actually authenticated or the impersonated one, the service checks against the local Apache Ranger policies, if the request is allowed for this certain user. This is the last instance in the process. A user passing this step is eventually allowed to perform the requested action.
Every time the authorization instance is called, i.e., policies are checked if the action of a user is authorized or not, an audit event is being logged, containing, time, user, service, action, data set and success of the event. An event is not logged in Ranger in case a user without authentication tries to access data or if a user tries to impersonate another user, without having appropriate permissions to do so.
Example Security Flow Using Apache Knox
Looking at the figure above you can follow what’s going on in the background, when a user Eric wants to push a file into the HDFS service on path “/user/eric/” from outside the Hadoop cluster firewall.
User Eric sends the HDFS request including the file and the command to put that file into the desired directory, while authenticating successfully via LDAP provider at the Apache Knox gateway using his username/password combination. Eric does not need to obtain a Kerberos ticket. In fact, since he is outside the cluster, he probably does not have access to the KDC through the firewall to obtain one anyway.
Knox Ranger plugin checks, if Eric is allowed to use Knox. If he’s not, the process ends here. This event is logged in Ranger audits.
Knox has a valid TGT (and refreshes it before it becomes invalid), obtains a service ticket with it and authenticates at the HDFS namenode as user “knox”.
Knox asks the service to perform the action as Eric, which is configured to be allowed.
Ranger HDFS plugin checks, if Eric has the permission to “WRITE” to “/user/eric”. If he’s not, the process ends here. This event is logged in Ranger audits.
File is pushed to HDFS.
I hope this article helps to get a better understanding of the sercurity concepts within the Hadoop Ecosystem. I published the original article on my blog.