Community Articles

Find and share helpful community-sourced technical articles.
Celebrating as our community reaches 100,000 members! Thank you!

This article contains Questions & Answers on CDP Security and Governance (SDX).


Does SDX provide governance for all the data in the cluster that’s in the cloud?

Yes. Any cluster that has been built with CDP will have governance applied to it, regardless of whether it’s deployed to a public or private cloud.


How do we enable end to end governance on CDP?  Is it different for public cloud and private cloud?

End-to-end governance is enabled by default in CDP public cloud and private cloud.  Under the hood, Apache Ranger and Apache Atlas are out-of-the-box wired into all the processing engines in CDP (e.g. Apache Hive, Impala, NiFi, Spark et al.).


How do you set up SDX?

SDX installation is completed when you provision an environment with wire and at-rest encryption preconfigured. Technical Metadata management functionalities are also set up automatically. Business Metadata and data policies must be implemented according to the customer’s context and requirements.


How do you get an SDX license? How much does it cost?

At present, SDX is part of CDP and not licensed separately.


Is the data catalog only available in CDP public cloud?

Today, Yes. The data catalog capabilities are available in both CDP public and can manage data from Azure and AWS.  Data Catalog is on the roadmap for the private cloud.   The core lineage and governance capabilities are available in CDP Private Base using Apache Atlas and Ranger directly.


It appears SDX is great for security and data classification. What about data quality?

Customers can leverage both Cloudera’s extensive partner ecosystem as well as the native CDP capabilities to tailor data quality capabilities to their specific requirements. A recent addition to the Data Catalog on this front is the Open Data Profiling capability.

Any pre-built rules/policies that come with SDX?

We provide common pre-built data profilers that look for common GDPR and PII values and pre-defined tags with Data Catalog. Data governance teams can either leverage these standard capabilities or customize their own to implement their specific data access and governance policies.


Is tagging done automatically?

Profiler has multiple tagging profiles that can be applied automatically, however the customer can modify or create their own profilers and associate tags based on the profiling criterion.  Data stewards can also trigger the execution of specific profilers against specific files.


Will tagging be migrated from CDH to CDP?

Users upgrading clusters from CDH to CDP Private Base will have Navigator’s properties and managed metadata tags automatically migrated to the Atlas’s equivalent properties and business metadata tags.


The policies in SDX are great.  What are some other SDX capabilities that I should be aware of?

Audit reports (API driven), asset search, end-to-end lineage, profilers; and soon: ML model registry, ML feature registry, schema registry, NiFi registry – all meta stores are consolidated in one place with SDX.


Where are data privacy policies stored? Are they in a central location? Can you audit them?

All data access and data privacy policies are stored in Ranger which is part of each CDP deployment. Data Catalog sits on top of Ranger and Atlas and pulls data from both of these repositories. In the public cloud all the audit logs would be stored on either the customer’s AWS S3 object store or Azure’s ADLS.


Can you migrate Sentry policies to Ranger, i.e. from CDH to CDP?

Yes. You can migrate them using CDP Replication Manager for migrations.  The Sentry policies are also converted automatically into Ranger equivalents during in-place upgrades.


How is the SDX integrated with Kubernetes? Does it use Knox?

All the services within CDP run on virtual machines directly or within Kubernetes.  Currently, Data Hub clusters and the SDX data lake cluster use traditional virtual machines without Kubernetes. However, the cloud-optimized experiences – Cloudera Data Warehouse and Cloudera Machine Learning – are Kubernetes applications that integrate with and use the traditional CDH/HDP distribution components.  All of these services are deployed, hosted inside your VPC,  and deployed out of the box to protect and proxy into the VPC using Apache Knox.


How are Ranger and Atlas incorporated into CDP?

Ranger and Atlas are open source projects and are incorporated into all of CDP’s services automatically, i.e. they are integrated with all of the CDP components out of the box: Apache Solr collections, HBase tables, Hive/Impala tables, Kafka topics, and NiFi flows.


Can you explain the difference between Atlas and the Data Catalog?

The Cloudera Data Catalog provides a single pane of glass for administration and use of data assets spanning all analytics and deployments. The Data Catalog surfaces and federates the information from the various Atlas instances running in each of the various CDP environments.  

For example, if you have an environment in AWS in Virginia and an Azure environment in the EU, each would have their own Altas instance and the Cloudera Data Catalog would talk to both to present information in a single interface. Altas is effectively the backend, collecting lineage information, and helping enforce policy, while the Data Catalog is the front-end that the data users use to navigate, search for, and steward the data.


How do the managed classifications help with securing data?

The combination of tags/classifications and tag based data protection policies enable you to restrict access to the data in the tagged assets.  When you tag or add management classifications to assets such as tables or columns in CDP, corresponding tag-based data protection policies associated will enforce access and even mask data.  What's more is that tags propagate to derived assets automatically along with the enforcement of tag-based policies. 


Is it possible to customize the sensitive data profilers?

Yes. See this doc.


How does CDP work with user-authentication?

Good security starts with identity. If you can’t assert who someone is, it doesn’t matter what the policies say they can or cannot do. The CDP Public Cloud platform brings the users’ on-prem identity into the cloud via SAML, a protocol that does not require direct network connectivity to a customer identity provider to assert identity. Users and groups are propagated throughout the entire platform, so access policies can be consistently applied to datasets and access points across the complete CDP platform. This helps enterprise infosec and governance teams centrally manage their employee directory and organizational groupings in their existing identity management system, and ensures security policies are automatically applied in the CDP platform as employees shift around organizations or new members are onboarded.


Is the security data masking and encryption applied to Kudu as well?

Yes.  Data masking can be applied to Kudu through Impala.  Access control and column masking are controlled via centralized Ranger policies. Kudu supports wire encryption for protecting data in motion. Volume encryption can be used for data at rest.


I understand the Data Catalog/tagging feature works in Hive/Impala queries. What if I access my data/table  from within Spark via Python or Scala? Will the permission/tagging rules still apply?

Yes. SDX provides a consistent view of the data. Assuming you access the data though the correct Spark API, you will view the data in the same masked/tagged way.


Is there any row-level security approach?

Yes. Ranger provides row-level security in CDP and allows for filters to be set on data according to users and groups as well as based on the content of fields within the rows.


Would I be able to use the Ranger Access Control rules/schema in custom applications?  i.e. be able to use the same access control uses in my operational systems?

Yes, Ranger has open APIs. Also, the identities and groups used in CDP can be federated with your organization’s identity system (Okta, Microsoft AD, etc). If your operational systems are also using that same identity, it is possible to leverage that same data security based user identity.


On the masking features: is the masking applied on the data on-the-fly when query is executed? And besides masking, can we apply encryption or hashing to ensure the uniqueness of the data?

Yes, masking is applied to the data on the fly. You can also use all the encryption functions in Hive 3. See this link for more info.


Can I use AD or LDAP policies?

CDP supports integration with Identity Management services like AD through SAML.