Created 10-08-2018 07:47 PM
Hello All,
we are planning to implement HDFS "Data at Rest" Encryption (Ranger KMS) on our datalake and have few below questions before we implement, can anyone help with these questions,
Thanks in advance
Created 10-08-2018 09:27 PM
HDFS "Data at Rest" Encryption
Hadoop provides several ways to encrypt stored data.
The last approach uses specially designated HDFS directories known as "encryption zones." simply a special HDFS directory within which all data is encrypted upon write, and decrypted upon read.
You can have multiple encryption zones with this configuration, you can use encrypted databases or tables with different encryption keys. To read data from read-only encrypted tables, users must have access to a temporary directory that is encrypted at least as strong as the table.
HDFS encryption is able to provide good performance and existing Hadoop applications are able to run transparently on encrypted data. Cloud data access server-side encryption slightly slows down performance when reading data from S3, both in the reading of data during the execution of a query and in scanning the files prior to the actual scheduling of work.
You can run two Hadoop performance tests, TestDFSIO and TeraSort, to measure performance in different encryption zones. TestDFSIO is more storage I/O- and throughput-focused, while TeraSort is representative of running a workload that is not only I/O- but also CPU-intensive. Both of these tests use the Hadoop distributed file system (HDFS). Ran these tests to compare encrypted data in different configurations but all also depends on your hardware eg Using E5-2699 v3 compared to Xeon E5-2697 v2 processors results in a significant increase in performance during test scenarios.
Created 10-08-2018 09:27 PM
HDFS "Data at Rest" Encryption
Hadoop provides several ways to encrypt stored data.
The last approach uses specially designated HDFS directories known as "encryption zones." simply a special HDFS directory within which all data is encrypted upon write, and decrypted upon read.
You can have multiple encryption zones with this configuration, you can use encrypted databases or tables with different encryption keys. To read data from read-only encrypted tables, users must have access to a temporary directory that is encrypted at least as strong as the table.
HDFS encryption is able to provide good performance and existing Hadoop applications are able to run transparently on encrypted data. Cloud data access server-side encryption slightly slows down performance when reading data from S3, both in the reading of data during the execution of a query and in scanning the files prior to the actual scheduling of work.
You can run two Hadoop performance tests, TestDFSIO and TeraSort, to measure performance in different encryption zones. TestDFSIO is more storage I/O- and throughput-focused, while TeraSort is representative of running a workload that is not only I/O- but also CPU-intensive. Both of these tests use the Hadoop distributed file system (HDFS). Ran these tests to compare encrypted data in different configurations but all also depends on your hardware eg Using E5-2699 v3 compared to Xeon E5-2697 v2 processors results in a significant increase in performance during test scenarios.
Created 10-09-2018 05:53 PM
@Geoffrey Shelton Okot : Thanks for your detailed explanation, this helps alot
Created 11-11-2018 09:49 PM
Nice to know the explanation helped. but also the best way to master is trying it out. Please could you take some time an"Accept " my response so other HCC member could easily reference it