Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Transparent Data Encryption (TDE) and Local Disks encryption for intermediate data

Solved Go to solution
Highlighted

Transparent Data Encryption (TDE) and Local Disks encryption for intermediate data

There is the recommendation that local disks should be encrypted for intermediate Data; need more info on this. Why is this so? How do we proposed encrypting the disk? Is this because Tez stores intermediate data on local disks? Also Map Reduce stores data in local disk with the "mapreduce.cluster.local.dir" parameter. So this has to be encrypted right?

So what are the best practices to encrypt the local disks for intermediate data? What is the manual effort involved? Is Hadoop Encrypted Shuffle enough?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Transparent Data Encryption (TDE) and Local Disks encryption for intermediate data

Contributor

@amcbarnett@hortonworks.com The concern is around who can get access to keys even if you are encrypting the mapreduce shuffle. Local disk encryption is for scenarios where some can take the disk out and read the data. Customers should adopt other methods (OS level access) to prevent users from getting access to nodes where the intermediate data might be stored

4 REPLIES 4

Re: Transparent Data Encryption (TDE) and Local Disks encryption for intermediate data

This capability allows encryption of the intermediate files generated during the merge and shuffle phases. It can be enabled by setting the mapreduce.job.encrypted-intermediate-data job property to true. Set in mapred-default.xml the following:

<property>
  <name>mapreduce.job.encrypted-intermediate-data</name>
  <value>false</value>
  <description>Encrypt intermediate MapReduce spill files or not
  default is false</description>
</property>

<property>
  <name>mapreduce.job.encrypted-intermediate-data-key-size-bits</name>
  <value>128</value>
  <description>Mapreduce encrypt data key size default is 128</description>
</property>

<property>
  <name>mapreduce.job.encrypted-intermediate-data.buffer.kb</name>
  <value>128</value>
  <description>Buffer size for intermediate encrypt data in kb
  default is 128</description>
</property>

NOTE: Currently, enabling encrypted intermediate data spills would restrict the number of attempts of the job to 1.

It is only available in MR2

Re: Transparent Data Encryption (TDE) and Local Disks encryption for intermediate data

Contributor

@amcbarnett@hortonworks.com The concern is around who can get access to keys even if you are encrypting the mapreduce shuffle. Local disk encryption is for scenarios where some can take the disk out and read the data. Customers should adopt other methods (OS level access) to prevent users from getting access to nodes where the intermediate data might be stored

Re: Transparent Data Encryption (TDE) and Local Disks encryption for intermediate data

@bganesan@hortonworks.com

Trying to distill this to a best practice, is the following a correct understadning?

In order to ensure data isn't ever written unencrypted (even during shuffle), am I correct in recommending the best approach here is to ensure OS-level encryption is set up for the partitions that store mapreduce ($hadoop.tmp.dir) and tez temporary data? Then we ensure that the HDFS data directories are on a separate, unencrypted partition where we can let HDFS Native Encryption selectively encrypt specified zones.

Re: Transparent Data Encryption (TDE) and Local Disks encryption for intermediate data

Contributor

Sounds right. @rvenkatesh@hortonworks.com @bdurai@hortonworks.com can you confirm?

Don't have an account?
Coming from Hortonworks? Activate your account here