I’m studying the HDP solutions in order to deploy it on a fresh openstack install.
Is it possible to compartmentalize data into the HDP Cluster ? In other words, isolate confidential data and common data ?
- Please read the security guide:
- You can setup Encryption Zones, which will encrypt data stored in these pre-defined zones
- Ranger enables RBAC, ABAC, Column level access. Also enables Data Masking (e.g. PII, Financial, Sensitive data). You often have scenarios where the sensitive data is a subset of fields in a Hive Table. These Fields can be masked, or access blocked. Atlas enables defining Tags (i.e. PII Tag) which can be passed automatically to Ranger to define and enforce policies.
- HDFS is the under-lying filesystem. You will need to understand what components (Spark, Hive, HBase, Solr, etc.) you will be using, where the sensitive data exists, and how to manage.
- HWX has a number of Partners, with sophisticated capabilities, should there be a requirement beyond what HDP provides out of the box (though check the documentation first).
Thanks you Graham for your answer.
It's very interesting and exhaustive.
I will take a time to read the whole documentation.
Last question, is it possible to choose dataNodes on which data will be stored ? For instance, Project A on DataNodes EAST1 and EAST2 and Project B on DataNodes WEST1 and WEST2 ?
The isolation is usually provided at the Logical level by Ranger. If you have multiple tenants - Project A and B, trying to manually manage data locality will get very difficult very quickly. Keep in mind that there is a Replication factor of 3 by default also (each block resides in 3x nodes in the cluster).
The same is true for workloads (i.e. YARN Queue Management) - it is easier to manage logically, and assign tenants a % of resources - than try and carve up the cluster physically (though node labels can offer some flexibility).
There are ways to achieve data locality (though non-trivial), and future versions may make this easier.
Might be worth thinking through the requirements, and understanding the workloads, user interaction, etc., and then working out if locking down data to nodes makes sense. Locking down data locality is somewhat counter to Hadoop's core (smooth elastic scaling, etc.).