We want to backup the HDFS data in our Cloudera Hadoop cluster to Amazon S3. Looks like we can use distcp for this but what is not clear is if the data is copied to S3 over an encrypted transport SSL/TLS.
Is there something that needs to be configured to enable using SSL/TLS for distcp?
Also, I see Amazon has their own flavour of distcp called s3distcp. But the documentation for s3distcp says it stages a temporary copy of the output in HDFS on the cluster. For example, if you copy 500 GB of data from HDFS to S3, S3DistCp copies the entire 500 GB into a temporary directory in HDFS, then uploads the data to Amazon S3 from the temporary directory - this is not insignifcant if one has a large cluster.
Does distcp have this same behaviour? I could not tell from the documentation
After a big more digging it looks like Hadoop distcp is using the Jets3t library to communicate with Amazon S3.
From the JetS3t docs it looks like HTTPs is the default:
|s3service.https-only||If true, all communication with S3 will be via encrypted HTTPS connections, otherwise communications will be sent unencrypted via HTTP|
Can anyone confirm this . S3a comes with default https protocol and we dont need to enable TLS/SSL at cluster level to transfer data securly from hdfs to s3
yes s3a use ssl connectio with s3 by default
fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 (default: true)
and you cna verify it with bucket policy
now distcp with Dfs.s3a.connection.ssl.enabled=false will fail ,since bucket policy allow only https request
[25761081@xxxxxxxxxxx ~]$ hadoop distcp -Dfs.s3a.access.key=xxxxxxxxxxxxx -Dfs.s3a.secret.key=xxxxxxxxxxxxxxxxx -Dfs.s3a.proxy.host=xxxxxxxxxx.com -Dfs.s3a.proxy.port=80 -Dfs.s3a.connection.ssl.enabled=false /user/hive/warehouse/db.db/t1 s3a://xxxxxx-dev-cluster/