About emaxwell

emaxwell · ‎08-18-2016

Since version 2.6, Apache Hadoop has had the ability to encrypt files that are written to special directories called encryption zones. In order for this at-rest encryption to work, encryption keys need to be managed by a Key Management Service (KMS). Apache Ranger 0.5 provided a scalable, open source KMS to provide key management for the Hadoop ecosystem. These features have made it easier to implement business and mission critical applications on Hadoop where security is a concern. These business/mission critical applications have also brought with them the need for fault tolerance and disaster recovery. Using Apache Falcon, it is easy to configure the copying of data from the Production Hadoop cluster to an off-site Disaster Recover (DR) cluster. But what is the best way to handle the encrypted data? Decrypting/encrypting the data to transfer it can hinder performance, but how do you decrypt data on the DR site without the proper keys from the KMS? In this article, we will investigate 3 different scenarios for managing the encryption keys between two clusters when Ranger KMS is used as the key management infrastructure. Scenario 1 - Completely Separate KMS Instances The first scenario is the case where the Prod cluster has a Ranger KMS instance, and the KR cluster has a Ranger KMS instance. Each is completely separate with no copying of keys. This configuration has some advantage from a security perspective. Since there are two distinct KMS instances, the keys generated for encryption will be different even for the same directory within HDFS. This can provide a certain level of protection should the Production KMS instance be compromised, however, the tradeoff is in the performance of the data copy. To copy the data in this type of environment, use the DistCp command similarly to how you would in a non-encrypted environment. DistCp will take care of the decrypt/encrypt functions automatically: ProdCluster:~$ hadoop distcp -update hdfs://ProdCluster:8020/data/encrypted/file1.txt hdfs://DRCluster:8020/data/encrypted/ Scenario 2 - Two KMS instances, one database In this configuration, the Prod and DR clusters each have a separate KMS Server, but the KMS Servers are both configured to use the same database to store the keys. On the Prod cluster, configure the Ranger KMS per the Hadoop Security Guide. Once the KMS database is set up, copy the database configuration to the DR cluster's Ambari config tab. Make sure to turn off the "Setup Database and Database User" option at the bottom of the config page: Once the KMS instances are both set up and working, creation of the encryption keys in this environment is simpler. Create the encryption key on the Prod cluster using either the Ranger KMS UI (login to Ranger as keyadmin), or via the CLI: ProdCluster:~$ hadoop key create ProdKey1 Specify which key to use to encrypt the data directory. On the Prod cluster: ProdCluster:~$ hdfs crypto -createZone -keyName ProdKey1 -path /data/encrypted On the DR cluster, use the exact same command (even though it is for the DR cluster): DRCluster:~$ hdfs crypto -createZone -keyName ProdKey1 -path /data/encrypted Since both KMS instances use the same keys, the data can be copied using the /.reserved/raw virtual path to avoid decrypting/encrypting the data in transit. Note that it is important to use the -px flag on distcp to ensure that the EDEK (which is saved as an extended attribute) are transferred intact: ProdCluster~$ hadoop distcp -px hdfs://ProdCluster:8020/.reserved/raw/data/encrypted/file1.txt hdfs://DRCluster:8020/.reserved/raw/data/encrypted/ Scenario 3 - Two KMS instances, two databases In this configuration, the Prod and DR clusters each have a separate KMS Server, and each has it's own database store. In this scenario it is necessary to copy the keys from the Prod KMS database to the DR KMS database. The Prod and DR KMS instances are setup separately per the Hadoop Security Guide. The keys for the encryption zones are created on the Prod cluster (the same as Scenario 2): ProdCluster:~$ hadoop key create ProdKey1 Specify which key to use to encrypt the data directory on the Prod cluster: ProdCluster:~$ hdfs crypto -createZone -keyName ProdKey1 -path /data/encrypted Once the keys are created on the Prod cluster, a script is used to export the keys so they can be copied to the DR cluster. On the node where the KMS Server runs, execute the following: ProdCluster:~# cd /usr/hdp/current/ranger-kms ProdCluster:~# ./exportKeysToJCEKS.sh ProdCluster.keystore Enter Password for the keystore FILE : Enter Password for the KEY(s) stored in the keystore: Keys from Ranger KMS Database has been successfully exported into ProdCluster.keystore Now, the password protected keystore can be securely copied to the DR cluster node where the KMS Server runs: ProdCluster:~# scp ProdCluster.keystore DRCluster:/usr/hdp/current/ranger-kms/ Next, import the keys into the Ranger KMS database on the DR cluster. On the Ranger KMS node in the DR cluster, execute the following: DRCluster:~# cd /usr/hdp/current/ranger-kms DRCluster:~# ./importJCEKSKeys.sh ProdCluster.keystore jceks Enter Password for the keystore FILE : Enter Password for the KEY(s) stored in the keystore: Keys from ProdCluster.keystore has been successfully exported into RangerDB The last step is to create the encryption zone on the DR cluster and specify which key to use for encryption: DRCluster:~$ hdfs crypto -createZone -keyName ProdKey1 -path /data/encrypted Now date can be copied using the /.reserved/raw/ virtual path to avoid the decryption/encryption steps between the clusters: ProdCluster~$ hadoop distcp -px hdfs://ProdCluster:8020/.reserved/raw/data/encrypted/file1.txt hdfs://DRCluster:8020/.reserved/raw/data/encrypted/ Please note that the key copy procedure will need to be repeated when new keys are created or when keys are rotated within the KMS.

emaxwell · ‎08-17-2016

@Randy Gelhausen You can set the "Remote Owner" attribute to the user you want to own the files in HDFS. You can set "Remote Group" as well. Both of these are at the processor level and do not support Expression Language, so you'd have to set them for the processor. You could use a RouteOnAttribute processor to determine which user should own the files in HDFS and route the flow to the proper PutHDFS processor, but this will be more cumbersome than distributing keytabs to the users. In a secure environment, the users would likely need to have their keytab to write to HDFS anyway since you'd have to authenticate somehow and there's not a way presently to pass a Kerberos ticket to NiFi.

emaxwell · ‎08-17-2016

@Smart Solutions As @Michael Young stated, Zeppelin in 2.4 is TP. Zeppelin goes GA with the upcoming HDP 2.5 release (due out soon) and includes Kerberos integration. A full list of features is not available yet (as it's still in the hands of the devs), but should be available soon.

emaxwell · ‎08-15-2016

@jovan karamacoski You can enable multiple tiers of storage and specify where files should be stored to control. Check out the following link: http://hortonworks.com/blog/heterogeneous-storages-hdfs/ If you really need to control which nodes that data goes to as well, you can only set up the faster storage on the faster nodes. This is not recommended because it will lead to an imbalance on the cluster, but it is possible. to do.

emaxwell · ‎08-15-2016

@mkataria As of HDP 2.4, Zeppelin is only in Tech Preview and does not support Kerberos. Integration with Kerberos for Zeppelin will be available in the upcoming HDP 2.5 release due out very soon.

emaxwell · ‎08-15-2016

@Jason Hue HDFS only allows one write or append to a file at a time. Allowing concurrent writes would mean that the order of bytes in the file would be random. Even if the two appends write to different blocks, how would you determine which block comes first in the file? It's a difficult problem to solve that has not been done in HDFS.

emaxwell · ‎08-11-2016

@Sunile Manjee One way to accomplish this would be to change the permissions on the hive executable to remove read and execute access for group and other: chmod 400 /usr/hdp/current/hive-client/bin/hive

emaxwell · ‎08-11-2016

@mohamed sabri marnaoui Are you running the Ambari agent as a non-root user? If so, make sure that your sudoers file is correct per this documentation: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_Security_Guide/content/_sudoer_configuration.html

emaxwell · ‎08-11-2016

An HDFS rebalance should optimize how the files are distributed across your network. Is there a particular reason why you want to manually determine where the replicas are stored?

emaxwell · ‎08-11-2016

@Sunile Manjee As @SBandaru states, you will need to make sure that proper group membership is maintained for the non-standard users. If you specify the users at cluster creation time, Ambari will take care of this for you. If you create them after the fact, then you will need to verify group membership. You may also need to modify the auth_to_local filters if the non-standard users are in AD/LDAP and you need to map them to local users. Another thing to consider is if you run the Ambari agent as non-root. There are a number of sudo rules that need to be put in place for the ambari user that allow execution of commands as the various service accounts for purposes of starting/stopping the services, installing packages, etc. You'll need to modify the customizable users sudo entry to suit your environment.

Online	Offline
Last Visited	‎08-09-2023 04:54 PM

Member Since	‎07-30-2019 08:48 AM
Last Visited	‎08-09-2023 04:54 PM
Posts	181
Kudos received	197

Cloudera Community

Re: HDP 2.6.2 Upgrade Stuck

Re: What are the best practices around HDFS Transp...

Re: Any documentation on Kafka Governance ?

Re: Can we set exceptions to a SuperUser's access ...

Re: oozie sqoop job error

How to copy encrypted data between two HDP cluster...

Re: How can I configure NiFi to PutHDFS as a user ...

Re: Is Zeppelin in HDP mature as service for Produ...

Re: What is the procedure for re-replication of lo...

Re: Zepplin Kerberos Support

Re: Can hdfs appendToFile from different node to s...

Re: How to block Hive CLI access?

Re: Ambari can't start namenode: ambari-sudo.sh re...

Re: What is the procedure for re-replication of lo...

Re: Using non default hdp service accounts, what s...