Member since
07-30-2019
181
Posts
205
Kudos Received
51
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4958 | 10-19-2017 09:11 PM | |
1591 | 12-27-2016 06:46 PM | |
1236 | 09-01-2016 08:08 PM | |
1176 | 08-29-2016 04:40 PM | |
3011 | 08-24-2016 02:26 PM |
08-18-2016
04:51 AM
10 Kudos
Since version 2.6, Apache Hadoop has had the ability to encrypt files that are written to special directories called encryption zones. In order for this at-rest encryption to work, encryption keys need to be managed by a Key Management Service (KMS). Apache Ranger 0.5 provided a scalable, open source KMS to provide key management for the Hadoop ecosystem. These features have made it easier to implement business and mission critical applications on Hadoop where security is a concern. These business/mission critical applications have also brought with them the need for fault tolerance and disaster recovery. Using Apache Falcon, it is easy to configure the copying of data from the Production Hadoop cluster to an off-site Disaster Recover (DR) cluster. But what is the best way to handle the encrypted data? Decrypting/encrypting the data to transfer it can hinder performance, but how do you decrypt data on the DR site without the proper keys from the KMS? In this article, we will investigate 3 different scenarios for managing the encryption keys between two clusters when Ranger KMS is used as the key management infrastructure. Scenario 1 - Completely Separate KMS Instances The first scenario is the case where the Prod cluster has a Ranger KMS instance, and the KR cluster has a Ranger KMS instance. Each is completely separate with no copying of keys. This configuration has some advantage from a security perspective. Since there are two distinct KMS instances, the keys generated for encryption will be different even for the same directory within HDFS. This can provide a certain level of protection should the Production KMS instance be compromised, however, the tradeoff is in the performance of the data copy. To copy the data in this type of environment, use the DistCp command similarly to how you would in a non-encrypted environment. DistCp will take care of the decrypt/encrypt functions automatically: ProdCluster:~$ hadoop distcp -update hdfs://ProdCluster:8020/data/encrypted/file1.txt hdfs://DRCluster:8020/data/encrypted/ Scenario 2 - Two KMS instances, one database In this configuration, the Prod and DR clusters each have a separate KMS Server, but the KMS Servers are both configured to use the same database to store the keys. On the Prod cluster, configure the Ranger KMS per the Hadoop Security Guide. Once the KMS database is set up, copy the database configuration to the DR cluster's Ambari config tab. Make sure to turn off the "Setup Database and Database User" option at the bottom of the config page: Once the KMS instances are both set up and working, creation of the encryption keys in this environment is simpler. Create the encryption key on the Prod cluster using either the Ranger KMS UI (login to Ranger as keyadmin), or via the CLI: ProdCluster:~$ hadoop key create ProdKey1 Specify which key to use to encrypt the data directory. On the Prod cluster: ProdCluster:~$ hdfs crypto -createZone -keyName ProdKey1 -path /data/encrypted On the DR cluster, use the exact same command (even though it is for the DR cluster): DRCluster:~$ hdfs crypto -createZone -keyName ProdKey1 -path /data/encrypted Since both KMS instances use the same keys, the data can be copied using the /.reserved/raw virtual path to avoid decrypting/encrypting the data in transit. Note that it is important to use the -px flag on distcp to ensure that the EDEK (which is saved as an extended attribute) are transferred intact: ProdCluster~$ hadoop distcp -px hdfs://ProdCluster:8020/.reserved/raw/data/encrypted/file1.txt hdfs://DRCluster:8020/.reserved/raw/data/encrypted/
Scenario 3 - Two KMS instances, two databases In this configuration, the Prod and DR clusters each have a separate KMS Server, and each has it's own database store. In this scenario it is necessary to copy the keys from the Prod KMS database to the DR KMS database. The Prod and DR KMS instances are setup separately per the Hadoop Security Guide. The keys for the encryption zones are created on the Prod cluster (the same as Scenario 2): ProdCluster:~$ hadoop key create ProdKey1 Specify which key to use to encrypt the data directory on the Prod cluster: ProdCluster:~$ hdfs crypto -createZone -keyName ProdKey1 -path /data/encrypted
Once the keys are created on the Prod cluster, a script is used to export the keys so they can be copied to the DR cluster. On the node where the KMS Server runs, execute the following: ProdCluster:~# cd /usr/hdp/current/ranger-kms
ProdCluster:~# ./exportKeysToJCEKS.sh ProdCluster.keystore
Enter Password for the keystore FILE :
Enter Password for the KEY(s) stored in the keystore:
Keys from Ranger KMS Database has been successfully exported into ProdCluster.keystore
Now, the password protected keystore can be securely copied to the DR cluster node where the KMS Server runs: ProdCluster:~# scp ProdCluster.keystore DRCluster:/usr/hdp/current/ranger-kms/ Next, import the keys into the Ranger KMS database on the DR cluster. On the Ranger KMS node in the DR cluster, execute the following: DRCluster:~# cd /usr/hdp/current/ranger-kms
DRCluster:~# ./importJCEKSKeys.sh ProdCluster.keystore jceks
Enter Password for the keystore FILE :
Enter Password for the KEY(s) stored in the keystore:
Keys from ProdCluster.keystore has been successfully exported into RangerDB
The last step is to create the encryption zone on the DR cluster and specify which key to use for encryption: DRCluster:~$ hdfs crypto -createZone -keyName ProdKey1 -path /data/encrypted Now date can be copied using the /.reserved/raw/ virtual path to avoid the decryption/encryption steps between the clusters: ProdCluster~$ hadoop distcp -px hdfs://ProdCluster:8020/.reserved/raw/data/encrypted/file1.txt hdfs://DRCluster:8020/.reserved/raw/data/encrypted/ Please note that the key copy procedure will need to be repeated when new keys are created or when keys are rotated within the KMS.
... View more
07-14-2016
07:43 PM
@Kaliyug Antagonist You will typically need to do some configuration on the views to make them work properly. In a secured cluster, you have to specify all of the parameters for connecting to the particular service instead of using the "Local Cluster" configuration drop down. The Ambari Views Documentation contains instructions for configuring all of the various views.
... View more
07-11-2016
06:46 PM
Hi @khaja pasha shaik Current HiveDR functionality doesn't support SSL. Thanks Juan
... View more
07-26-2016
08:01 AM
ranger-home-directory-policy.png@Kaliyug Antagonist We've found another neat solution to this, using a resource path of the form: "/user/${id}" Credit to Naveed Hussain, who found it after we moaned a lot about the alternatives. Screenshot attached.
... View more
06-27-2016
01:13 PM
I had no issues using Anaconda as my development python on my production cluster. Just be sure to install it in a separate location and don't overwrite the standard OS install of Python.
... View more
06-08-2016
02:58 PM
Thanks @emaxwell I hope this will help most of us, specially the ones using MS AD KDC. Regards
Mayank
... View more
05-17-2016
01:05 PM
8 Kudos
Virtual Memory swapping can have a large impact on the performance of a Hadoop system. Because of the memory requirements of YARN containers and processes running on the nodes in a cluster, swapping process out of memory to disk can cause serious performance limitations. As such, the historical recommendations for setting the swappiness, or propensity to swap out a process, on a Hadoop system has been to disable swap altogether. With newer versions of the Linux kernel, Out Of Memory (OOM) situations can be more likely to indiscriminately kill important processes to reclaim valuable physical memory on the system with a swappiness of 0. In order to prevent the system from swapping processes too frequently, but still allow for emergency swapping (instead of killing processes), the recommendation is now to set swappiness to 1 on Linux systems. This will still allow swapping, but with the least possible aggressiveness (for comparison, the default value for swappiness is 60). To change the swappiness on a running machine, use the following command: echo "1" > /proc/sys/vm/swappiness To ensure the swappiness is set appropriately on reboot, use the following command: echo "vm.swappiness=1" >> /etc/sysctl.conf
... View more
Labels:
05-15-2016
05:07 PM
@ida ida There a couple of ways to accomplish this id recommend starting with sqoop. It is a tool designed specifically to extract data from an RDBMS and load it into Hadoop. This tutorial should help you get started.
... View more
05-04-2016
04:31 PM
Thank you. Logged in hiveserver2, but under audit I see as below. Can I consider these as audit logs? /var/log/hive/audit/hdfs/spool # ll -lrt
total 16
-rw-r--r-- 1 hive hadoop 0 Apr 28 02:45 index_batch_batch.hdfs_hiveServer2_closed.json
drwxr-xr-x 2 hive hadoop 4096 Apr 28 02:45 archive
-rw-r--r-- 1 hive hadoop 458 Apr 28 03:26 spool_hiveServer2_20160428-0326.52.log
-rw-r--r-- 1 hive hadoop 455 Apr 28 03:29 spool_hiveServer2_20160428-0329.02.log
-rw-r--r-- 1 hive hadoop 599 Apr 28 03:29 index_batch_batch.hdfs_hiveServer2.json
... View more
- « Previous
-
- 1
- 2
- Next »