Reply
New Contributor
Posts: 2
Registered: ‎02-08-2015

Securely transferring data from HDFS to amazon S3 using distcp

We want to backup the HDFS data in our Cloudera Hadoop cluster to Amazon S3. Looks like we can use distcp for this but what is not clear is if the data is copied to S3 over an encrypted transport SSL/TLS.

 

Is there something that needs to be configured to enable using SSL/TLS for distcp?

 

Also, I see Amazon has their own flavour of distcp called s3distcp. But the documentation for s3distcp says it stages a temporary copy of the output in HDFS on the cluster. For example, if you copy 500 GB of data from HDFS to S3, S3DistCp copies the entire 500 GB into a temporary directory in HDFS, then uploads the data to Amazon S3 from the temporary directory - this is not insignifcant if one has a large cluster.

 

Does distcp have this same behaviour? I could not tell from the documentation

New Contributor
Posts: 2
Registered: ‎02-08-2015

Re: Securely transferring data from HDFS to amazon S3 using distcp

After a big more digging it looks like Hadoop distcp is using the Jets3t library to communicate with Amazon S3. 

 

From the JetS3t docs it looks like HTTPs is the default:

 

RestS3Service

s3service.https-onlyIf true, all communication with S3 will be via encrypted HTTPS connections, otherwise communications will be sent unencrypted via HTTP
Default: true

 

http://jets3t.s3.amazonaws.com/toolkit/configuration.html

Posts: 1,754
Kudos: 371
Solutions: 279
Registered: ‎07-31-2013

Re: Securely transferring data from HDFS to amazon S3 using distcp

You may also be interested in the S3A (s3a://) connector shipped in
CDH 5.3.0 onwards that uses the AMZN SDK directly and also uses HTTPS.

Expert Contributor
Posts: 113
Registered: ‎02-15-2016

Re: Securely transferring data from HDFS to amazon S3 using distcp

[ Edited ]

Can anyone confirm this . S3a comes with default https protocol and we dont need to enable TLS/SSL at cluster level to transfer data securly from hdfs to s3

Expert Contributor
Posts: 113
Registered: ‎02-15-2016

Re: Securely transferring data from HDFS to amazon S3 using distcp

yes s3a use ssl connectio with s3 by default 

 

fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 (default: true)

and you cna verify it with bucket policy 

 

{
"Version": "2008-10-17",
"Statement": [
{
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": "arn:aws:s3:::XXXX-dev-cluster/*",
"Condition": {
"Bool": {
"aws:SecureTransport": "false"
}
}
}
]
}

 

now distcp with Dfs.s3a.connection.ssl.enabled=false  will fail ,since bucket policy allow only https request 

 

[25761081@xxxxxxxxxxx ~]$ hadoop distcp -Dfs.s3a.access.key=xxxxxxxxxxxxx -Dfs.s3a.secret.key=xxxxxxxxxxxxxxxxx -Dfs.s3a.proxy.host=xxxxxxxxxx.com -Dfs.s3a.proxy.port=80 -Dfs.s3a.connection.ssl.enabled=false /user/hive/warehouse/db.db/t1 s3a://xxxxxx-dev-cluster/

 

 

Announcements