About emaxwell

emaxwell · ‎08-18-2016

Since version 2.6, Apache Hadoop has had the ability to encrypt files that are written to special directories called encryption zones. In order for this at-rest encryption to work, encryption keys need to be managed by a Key Management Service (KMS). Apache Ranger 0.5 provided a scalable, open source KMS to provide key management for the Hadoop ecosystem. These features have made it easier to implement business and mission critical applications on Hadoop where security is a concern. These business/mission critical applications have also brought with them the need for fault tolerance and disaster recovery. Using Apache Falcon, it is easy to configure the copying of data from the Production Hadoop cluster to an off-site Disaster Recover (DR) cluster. But what is the best way to handle the encrypted data? Decrypting/encrypting the data to transfer it can hinder performance, but how do you decrypt data on the DR site without the proper keys from the KMS? In this article, we will investigate 3 different scenarios for managing the encryption keys between two clusters when Ranger KMS is used as the key management infrastructure. Scenario 1 - Completely Separate KMS Instances The first scenario is the case where the Prod cluster has a Ranger KMS instance, and the KR cluster has a Ranger KMS instance. Each is completely separate with no copying of keys. This configuration has some advantage from a security perspective. Since there are two distinct KMS instances, the keys generated for encryption will be different even for the same directory within HDFS. This can provide a certain level of protection should the Production KMS instance be compromised, however, the tradeoff is in the performance of the data copy. To copy the data in this type of environment, use the DistCp command similarly to how you would in a non-encrypted environment. DistCp will take care of the decrypt/encrypt functions automatically: ProdCluster:~$ hadoop distcp -update hdfs://ProdCluster:8020/data/encrypted/file1.txt hdfs://DRCluster:8020/data/encrypted/ Scenario 2 - Two KMS instances, one database In this configuration, the Prod and DR clusters each have a separate KMS Server, but the KMS Servers are both configured to use the same database to store the keys. On the Prod cluster, configure the Ranger KMS per the Hadoop Security Guide. Once the KMS database is set up, copy the database configuration to the DR cluster's Ambari config tab. Make sure to turn off the "Setup Database and Database User" option at the bottom of the config page: Once the KMS instances are both set up and working, creation of the encryption keys in this environment is simpler. Create the encryption key on the Prod cluster using either the Ranger KMS UI (login to Ranger as keyadmin), or via the CLI: ProdCluster:~$ hadoop key create ProdKey1 Specify which key to use to encrypt the data directory. On the Prod cluster: ProdCluster:~$ hdfs crypto -createZone -keyName ProdKey1 -path /data/encrypted On the DR cluster, use the exact same command (even though it is for the DR cluster): DRCluster:~$ hdfs crypto -createZone -keyName ProdKey1 -path /data/encrypted Since both KMS instances use the same keys, the data can be copied using the /.reserved/raw virtual path to avoid decrypting/encrypting the data in transit. Note that it is important to use the -px flag on distcp to ensure that the EDEK (which is saved as an extended attribute) are transferred intact: ProdCluster~$ hadoop distcp -px hdfs://ProdCluster:8020/.reserved/raw/data/encrypted/file1.txt hdfs://DRCluster:8020/.reserved/raw/data/encrypted/ Scenario 3 - Two KMS instances, two databases In this configuration, the Prod and DR clusters each have a separate KMS Server, and each has it's own database store. In this scenario it is necessary to copy the keys from the Prod KMS database to the DR KMS database. The Prod and DR KMS instances are setup separately per the Hadoop Security Guide. The keys for the encryption zones are created on the Prod cluster (the same as Scenario 2): ProdCluster:~$ hadoop key create ProdKey1 Specify which key to use to encrypt the data directory on the Prod cluster: ProdCluster:~$ hdfs crypto -createZone -keyName ProdKey1 -path /data/encrypted Once the keys are created on the Prod cluster, a script is used to export the keys so they can be copied to the DR cluster. On the node where the KMS Server runs, execute the following: ProdCluster:~# cd /usr/hdp/current/ranger-kms ProdCluster:~# ./exportKeysToJCEKS.sh ProdCluster.keystore Enter Password for the keystore FILE : Enter Password for the KEY(s) stored in the keystore: Keys from Ranger KMS Database has been successfully exported into ProdCluster.keystore Now, the password protected keystore can be securely copied to the DR cluster node where the KMS Server runs: ProdCluster:~# scp ProdCluster.keystore DRCluster:/usr/hdp/current/ranger-kms/ Next, import the keys into the Ranger KMS database on the DR cluster. On the Ranger KMS node in the DR cluster, execute the following: DRCluster:~# cd /usr/hdp/current/ranger-kms DRCluster:~# ./importJCEKSKeys.sh ProdCluster.keystore jceks Enter Password for the keystore FILE : Enter Password for the KEY(s) stored in the keystore: Keys from ProdCluster.keystore has been successfully exported into RangerDB The last step is to create the encryption zone on the DR cluster and specify which key to use for encryption: DRCluster:~$ hdfs crypto -createZone -keyName ProdKey1 -path /data/encrypted Now date can be copied using the /.reserved/raw/ virtual path to avoid the decryption/encryption steps between the clusters: ProdCluster~$ hadoop distcp -px hdfs://ProdCluster:8020/.reserved/raw/data/encrypted/file1.txt hdfs://DRCluster:8020/.reserved/raw/data/encrypted/ Please note that the key copy procedure will need to be repeated when new keys are created or when keys are rotated within the KMS.

emaxwell · ‎07-14-2016

@Kaliyug Antagonist You will typically need to do some configuration on the views to make them work properly. In a secured cluster, you have to specify all of the parameters for connecting to the particular service instead of using the "Local Cluster" configuration drop down. The Ambari Views Documentation contains instructions for configuring all of the various views.

jchiapetta · ‎07-11-2016

Hi @khaja pasha shaik Current HiveDR functionality doesn't support SSL. Thanks Juan

robert_jones · ‎07-26-2016

ranger-home-directory-policy.png@Kaliyug Antagonist We've found another neat solution to this, using a resource path of the form: "/user/${id}" Credit to Naveed Hussain, who found it after we moaned a lot about the alternatives. Screenshot attached.

myoung · ‎06-27-2016

I had no issues using Anaconda as my development python on my production cluster. Just be sure to install it in a separate location and don't overwrite the standard OS install of Python.

mkataria · ‎06-08-2016

Thanks @emaxwell I hope this will help most of us, specially the ones using MS AD KDC. Regards Mayank

mohanamurali_gu · ‎06-07-2016

@emaxwell Thanks a lot for pointing to the right direction.

emaxwell · ‎05-17-2016

Virtual Memory swapping can have a large impact on the performance of a Hadoop system. Because of the memory requirements of YARN containers and processes running on the nodes in a cluster, swapping process out of memory to disk can cause serious performance limitations. As such, the historical recommendations for setting the swappiness, or propensity to swap out a process, on a Hadoop system has been to disable swap altogether. With newer versions of the Linux kernel, Out Of Memory (OOM) situations can be more likely to indiscriminately kill important processes to reclaim valuable physical memory on the system with a swappiness of 0. In order to prevent the system from swapping processes too frequently, but still allow for emergency swapping (instead of killing processes), the recommendation is now to set swappiness to 1 on Linux systems. This will still allow swapping, but with the least possible aggressiveness (for comparison, the default value for swappiness is 60). To change the swappiness on a running machine, use the following command: echo "1" > /proc/sys/vm/swappiness To ensure the swappiness is set appropriately on reboot, use the following command: echo "vm.swappiness=1" >> /etc/sysctl.conf

emaxwell · ‎05-15-2016

@ida ida There a couple of ways to accomplish this id recommend starting with sqoop. It is a tool designed specifically to extract data from an RDBMS and load it into Hadoop. This tutorial should help you get started.

Neyyu · ‎05-04-2016

Thank you. Logged in hiveserver2, but under audit I see as below. Can I consider these as audit logs? /var/log/hive/audit/hdfs/spool # ll -lrt total 16 -rw-r--r-- 1 hive hadoop 0 Apr 28 02:45 index_batch_batch.hdfs_hiveServer2_closed.json drwxr-xr-x 2 hive hadoop 4096 Apr 28 02:45 archive -rw-r--r-- 1 hive hadoop 458 Apr 28 03:26 spool_hiveServer2_20160428-0326.52.log -rw-r--r-- 1 hive hadoop 455 Apr 28 03:29 spool_hiveServer2_20160428-0329.02.log -rw-r--r-- 1 hive hadoop 599 Apr 28 03:29 index_batch_batch.hdfs_hiveServer2.json

Online	Offline
Last Visited	‎08-09-2023 04:54 PM

Member Since	‎07-30-2019 08:48 AM
Last Visited	‎08-09-2023 04:54 PM
Posts	181
Kudos received	197

Cloudera Community

Re: HDP 2.6.2 Upgrade Stuck

Re: What are the best practices around HDFS Transp...

Re: Any documentation on Kafka Governance ?

Re: Can we set exceptions to a SuperUser's access ...

Re: oozie sqoop job error

How to copy encrypted data between two HDP cluster...

Re: Hue installation HDP 2.4 and RHEL7

Re: how to setup the hive client url ssl=true?

Re: HDFS Policy 'resource path' with placeholder -...

Re: I have HDP 2.4 installed on a cluster of RedHa...

Re: Kerberos - Potential Security Threat

Re: Support for Cassandra ACL

Swappiness setting recommendation

Re: How to extract and load data from an Oracle da...

Re: Namenode, ranger and hive audit logs