Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Transparent Data Encryption (TDE) and Local Disks encryption for intermediate data

avatar

There is the recommendation that local disks should be encrypted for intermediate Data; need more info on this. Why is this so? How do we proposed encrypting the disk? Is this because Tez stores intermediate data on local disks? Also Map Reduce stores data in local disk with the "mapreduce.cluster.local.dir" parameter. So this has to be encrypted right?

So what are the best practices to encrypt the local disks for intermediate data? What is the manual effort involved? Is Hadoop Encrypted Shuffle enough?

1 ACCEPTED SOLUTION

avatar
Rising Star

@amcbarnett@hortonworks.com The concern is around who can get access to keys even if you are encrypting the mapreduce shuffle. Local disk encryption is for scenarios where some can take the disk out and read the data. Customers should adopt other methods (OS level access) to prevent users from getting access to nodes where the intermediate data might be stored

View solution in original post

4 REPLIES 4

avatar

This capability allows encryption of the intermediate files generated during the merge and shuffle phases. It can be enabled by setting the mapreduce.job.encrypted-intermediate-data job property to true. Set in mapred-default.xml the following:

<property>
  <name>mapreduce.job.encrypted-intermediate-data</name>
  <value>false</value>
  <description>Encrypt intermediate MapReduce spill files or not
  default is false</description>
</property>

<property>
  <name>mapreduce.job.encrypted-intermediate-data-key-size-bits</name>
  <value>128</value>
  <description>Mapreduce encrypt data key size default is 128</description>
</property>

<property>
  <name>mapreduce.job.encrypted-intermediate-data.buffer.kb</name>
  <value>128</value>
  <description>Buffer size for intermediate encrypt data in kb
  default is 128</description>
</property>

NOTE: Currently, enabling encrypted intermediate data spills would restrict the number of attempts of the job to 1.

It is only available in MR2

avatar
Rising Star

@amcbarnett@hortonworks.com The concern is around who can get access to keys even if you are encrypting the mapreduce shuffle. Local disk encryption is for scenarios where some can take the disk out and read the data. Customers should adopt other methods (OS level access) to prevent users from getting access to nodes where the intermediate data might be stored

avatar
@bganesan@hortonworks.com

Trying to distill this to a best practice, is the following a correct understadning?

In order to ensure data isn't ever written unencrypted (even during shuffle), am I correct in recommending the best approach here is to ensure OS-level encryption is set up for the partitions that store mapreduce ($hadoop.tmp.dir) and tez temporary data? Then we ensure that the HDFS data directories are on a separate, unencrypted partition where we can let HDFS Native Encryption selectively encrypt specified zones.

avatar
Rising Star

Sounds right. @rvenkatesh@hortonworks.com @bdurai@hortonworks.com can you confirm?