I have two clusters behind a firewall and I would like run distcp to copy data from one cluster to another. What ports should I open in the firewall for this communication? For example, I know I need 50070 to the NameNode. But what other ports are required?
Sorry, still isn't clear what would be the source and destination on the ACL be?
Let's say, we have clusters A and B.
Cluster A datanodes are datanode-A-[1,10] and namenode is namenode-A-1. And, Cluster B datanodes are datanode-B-[1,10] and namenode is namenode-B-1.
1) Do I initiate "distcp" on a host on Cluster A or B?
2) Do ports need to be opened up A->B or B->A?
3) If it is A->B, then what hosts on A need access to what hosts on B?
4) If it is B->A, then what hosts on B need access to what hosts on A?
The simple answer is to open up the ports in a bidirectional manner on all the hosts. For instance:
on each node in cluster A: Allow connectivity to 1004 (or 50010 without Kerberos) and 50020 on each datanode in cluster B. As well as 8020 to namenodes in Cluster B.
on each node in cluster B: Allow connectivity to 1004 (or 50010 without Kerberos) and 50020 on each datanode in cluster A. As well as 8020 to namenodes in Cluster A.
However... You are right, where the distcp is executed will determine the source/destination. Executing distcp on Cluster A will cause a mapreduce job to run on cluster A. Each datanode will(may) run a task that will connect to the namenode(s) on cluster B for block locations and then datanodes on cluster B for transfer. I'm not sure if the node the distcp is executed on will need access as well. So I generally run the distcp on one of the datanodes.