Created 05-25-2016 05:02 PM
Hi,
I am trying to transfer HDFS files securely between two clusters using
hadoop distcp hsftp://<host1>:50470/srcPath hdfs://<host2>:8020/destPath.
"HSFTP, uses HTTPS by default. This means that data will be encrypted in transit"
Source Cluster is made secure with ssl setup on all nodes and dfs.http.policy is set to HTTP_AND_HTTPS .In destination cluster we have truststore of source cluster.
I understand that Distcp hsftp command when we run on destination cluster, it talks to source name node on 50470 port which is secure. Does that mean actual data transfer between data nodes is also secure? If so, can someone explain me how it works .
Created 06-14-2016 05:43 AM
The way this works is that the HTTP client first initiates a call to the NameNode using either the "http" or "https" scheme. For a file read or write operation, the NameNode will select an appropriate DataNode and send an HTTP 302 redirect response back to the client telling it to reconnect to that DataNode to complete its request. When the NameNode performs this redirect, it detects the scheme of the incoming call that was sent to it and preserves that scheme in the Location header of the HTTP 302 redirect response. Thus, for a request originating at the NameNode via "http", the redirection will point to an "http" URL on a DataNode, and for a request originating at the NameNode via "https", the redirection will point to an "https" URL on a DataNode.
Created 06-14-2016 12:50 AM
When data is being transferred from secure to unsecure cluster via distcp. User will require to set ipc.client.fallback-to-simple-auth-allowed=true on secure machine otherwise distcp operation will fail with permission error.
When ipc.client.fallback-to-simple-auth-allowed is set to true, hdfs client switch to SASL SIMPLE (unsecure) authentication.
Created 06-14-2016 06:22 PM
Hi Yvora,
I didnt set this property and didnt face any permission issue. We are using hsftp and captured packets during transit. Data is encrypted and communication is happening over secure ports [50470, 50475 ]. Please confirm.
Created 06-14-2016 05:43 AM
The way this works is that the HTTP client first initiates a call to the NameNode using either the "http" or "https" scheme. For a file read or write operation, the NameNode will select an appropriate DataNode and send an HTTP 302 redirect response back to the client telling it to reconnect to that DataNode to complete its request. When the NameNode performs this redirect, it detects the scheme of the incoming call that was sent to it and preserves that scheme in the Location header of the HTTP 302 redirect response. Thus, for a request originating at the NameNode via "http", the redirection will point to an "http" URL on a DataNode, and for a request originating at the NameNode via "https", the redirection will point to an "https" URL on a DataNode.