02-08-2015 04:37 PM
We want to backup the HDFS data in our Cloudera Hadoop cluster to Amazon S3. Looks like we can use distcp for this but what is not clear is if the data is copied to S3 over an encrypted transport SSL/TLS.
Is there something that needs to be configured to enable using SSL/TLS for distcp?
Also, I see Amazon has their own flavour of distcp called s3distcp. But the documentation for s3distcp says it stages a temporary copy of the output in HDFS on the cluster. For example, if you copy 500 GB of data from HDFS to S3, S3DistCp copies the entire 500 GB into a temporary directory in HDFS, then uploads the data to Amazon S3 from the temporary directory - this is not insignifcant if one has a large cluster.
Does distcp have this same behaviour? I could not tell from the documentation
02-10-2015 07:00 AM
After a big more digging it looks like Hadoop distcp is using the Jets3t library to communicate with Amazon S3.
From the JetS3t docs it looks like HTTPs is the default:
|s3service.https-only||If true, all communication with S3 will be via encrypted HTTPS connections, otherwise communications will be sent unencrypted via HTTP|
02-10-2015 07:11 AM
05-28-2018 11:31 AM - edited 05-28-2018 11:32 AM
Can anyone confirm this . S3a comes with default https protocol and we dont need to enable TLS/SSL at cluster level to transfer data securly from hdfs to s3
05-29-2018 10:51 AM
yes s3a use ssl connectio with s3 by default
fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 (default: true)
and you cna verify it with bucket policy
now distcp with Dfs.s3a.connection.ssl.enabled=false will fail ,since bucket policy allow only https request
[25761081@xxxxxxxxxxx ~]$ hadoop distcp -Dfs.s3a.access.key=xxxxxxxxxxxxx -Dfs.s3a.secret.key=xxxxxxxxxxxxxxxxx -Dfs.s3a.proxy.host=xxxxxxxxxx.com -Dfs.s3a.proxy.port=80 -Dfs.s3a.connection.ssl.enabled=false /user/hive/warehouse/db.db/t1 s3a://xxxxxx-dev-cluster/