We can secure the data in Hadoop using different methods. Each method has its own advantages. We can also combine more than one method for better result
|1||Kerberos||Kerberos is a network authentication protocol|
Advantage: Authenticate users at the entry level.
Limitation: Kerberos prevents unauthorized user access to the environment. But after login, it will not provide detailed level authentications like table, column, folder, file level, etc
Apache Sentry is a system for enforcing fine grained role based authorization
to data and metadata stored on a Hadoop cluster
Advantage: Application level authentications like Hive, Impala, Solr, etc. It can control access on DB, table, column level for a particular user/group.
Limitation 1: It cannot control the HDFS folders which are underlined behind applications like Hive, Impala, etc.
Ex: Hive table prod.table1 stored in /user/hive/warehouse/prod.db/table1. The sentry role setup in Hue can control only table/column access in Hue but It is possible that user can manage to access folders directly in HDFS
Limitation 2: HDFS folders which are not related to Hive, Impala, etc will not be controlled
|3||Access Control List (ACL)|
An access control list (ACL) is a list of access control entries (ACE).
Each ACE in an ACL identifies a trustee and specifies the access rights allowed, denied, or audited for that trustee
Advantage: Folder level access is possible by users using $hadoop fs -setfacl
|4||HDFS Data At Rest Encryption (EDEK)|
HDFS Encryption implements transparent, end-to-end encryption of data read from and written to HDFS
Advantage: Encrypt the data will provide additional level security. In General, Data encryption is required by a number of different government, financial, and regulatory entities
Ex: Unauthorized data access will return result in encrypted format
hadoop fs -cat /data/File.txt
This is a well written blog. I had a few points to add to it:
Kumar, pretty good informative points.
One question regarding ACL: if sentry is enabled do we need to disable ACL, in other words if sentry is enabled on the hive then ACL is required or not required. I did read some info on cloudera knowledge base under enabling sentry inforation: cloudera recommending not to enable ACL when sentry is enabled.
Any security mechanism for fine graned access for Spark SQL queries? How I can restrict the users to access only certain columns? I know there is a RecordService in Beta. Any other solutions that folks have used?
Apache Sentry will help you to restrict the user access on db/table/column for hive/impala/solr/etc.
You can set this acces for a group/user using role.
So access to those db/table/column via spark code will also be authorized by sentry
After access granted on a particular db/table via sentry for a user/group, I have login as a different user in HDFS and tried to acces the restricted db/table, but the different users couldn't access the restricted db/table. So my personal opinion is, it is not required to apply ACL on top of already restricted db/table. so we can go with cloudera recommendation.
But consider the use case that you have an important file/folder in HDFS (not a table) that you want to restrict from other users. So you can use ACL in this use case.