Created 10-21-2015 06:57 PM
There is the recommendation that local disks should be encrypted for intermediate Data; need more info on this. Why is this so? How do we proposed encrypting the disk? Is this because Tez stores intermediate data on local disks? Also Map Reduce stores data in local disk with the "mapreduce.cluster.local.dir" parameter. So this has to be encrypted right?
So what are the best practices to encrypt the local disks for intermediate data? What is the manual effort involved? Is Hadoop Encrypted Shuffle enough?
Created 10-21-2015 09:31 PM
@amcbarnett@hortonworks.com The concern is around who can get access to keys even if you are encrypting the mapreduce shuffle. Local disk encryption is for scenarios where some can take the disk out and read the data. Customers should adopt other methods (OS level access) to prevent users from getting access to nodes where the intermediate data might be stored
Created 10-21-2015 07:25 PM
This capability allows encryption of the intermediate files generated during the merge and shuffle phases. It can be enabled by setting the mapreduce.job.encrypted-intermediate-data job property to true. Set in mapred-default.xml the following:
<property> <name>mapreduce.job.encrypted-intermediate-data</name> <value>false</value> <description>Encrypt intermediate MapReduce spill files or not default is false</description> </property> <property> <name>mapreduce.job.encrypted-intermediate-data-key-size-bits</name> <value>128</value> <description>Mapreduce encrypt data key size default is 128</description> </property> <property> <name>mapreduce.job.encrypted-intermediate-data.buffer.kb</name> <value>128</value> <description>Buffer size for intermediate encrypt data in kb default is 128</description> </property>
NOTE: Currently, enabling encrypted intermediate data spills would restrict the number of attempts of the job to 1.
It is only available in MR2
Created 10-21-2015 09:31 PM
@amcbarnett@hortonworks.com The concern is around who can get access to keys even if you are encrypting the mapreduce shuffle. Local disk encryption is for scenarios where some can take the disk out and read the data. Customers should adopt other methods (OS level access) to prevent users from getting access to nodes where the intermediate data might be stored
Created 11-04-2015 06:06 PM
Trying to distill this to a best practice, is the following a correct understadning?
In order to ensure data isn't ever written unencrypted (even during shuffle), am I correct in recommending the best approach here is to ensure OS-level encryption is set up for the partitions that store mapreduce ($hadoop.tmp.dir) and tez temporary data? Then we ensure that the HDFS data directories are on a separate, unencrypted partition where we can let HDFS Native Encryption selectively encrypt specified zones.
Created 11-04-2015 06:12 PM
Sounds right. @rvenkatesh@hortonworks.com @bdurai@hortonworks.com can you confirm?