Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Data isolation in HortonWorks ?

Data isolation in HortonWorks ?

Hey guys,

I’m studying the HDP solutions in order to deploy it on a fresh openstack install.

Is it possible to compartmentalize data into the HDP Cluster ? In other words, isolate confidential data and common data ?

thx you

3 REPLIES 3
Highlighted

Re: Data isolation in HortonWorks ?

Expert Contributor

@faraon clément

Some comments....

- Please read the security guide:

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_security/content/ch_hdp-security-guide-ov...

- You can setup Encryption Zones, which will encrypt data stored in these pre-defined zones

- Ranger enables RBAC, ABAC, Column level access. Also enables Data Masking (e.g. PII, Financial, Sensitive data). You often have scenarios where the sensitive data is a subset of fields in a Hive Table. These Fields can be masked, or access blocked. Atlas enables defining Tags (i.e. PII Tag) which can be passed automatically to Ranger to define and enforce policies.

- HDFS is the under-lying filesystem. You will need to understand what components (Spark, Hive, HBase, Solr, etc.) you will be using, where the sensitive data exists, and how to manage.

- HWX has a number of Partners, with sophisticated capabilities, should there be a requirement beyond what HDP provides out of the box (though check the documentation first).

Highlighted

Re: Data isolation in HortonWorks ?

Thanks you Graham for your answer.

It's very interesting and exhaustive.

I will take a time to read the whole documentation.

Last question, is it possible to choose dataNodes on which data will be stored ? For instance, Project A on DataNodes EAST1 and EAST2 and Project B on DataNodes WEST1 and WEST2 ?

Highlighted

Re: Data isolation in HortonWorks ?

Expert Contributor

@faraon clément

The isolation is usually provided at the Logical level by Ranger. If you have multiple tenants - Project A and B, trying to manually manage data locality will get very difficult very quickly. Keep in mind that there is a Replication factor of 3 by default also (each block resides in 3x nodes in the cluster).

The same is true for workloads (i.e. YARN Queue Management) - it is easier to manage logically, and assign tenants a % of resources - than try and carve up the cluster physically (though node labels can offer some flexibility).

There are ways to achieve data locality (though non-trivial), and future versions may make this easier.

Might be worth thinking through the requirements, and understanding the workloads, user interaction, etc., and then working out if locking down data to nodes makes sense. Locking down data locality is somewhat counter to Hadoop's core (smooth elastic scaling, etc.).

Don't have an account?
Coming from Hortonworks? Activate your account here