Support Questions

Find answers, ask questions, and share your expertise

What firewall ports should I open for distcp between clusters?

avatar
Rising Star

I have two clusters behind a firewall and I would like run distcp to copy data from one cluster to another. What ports should I open in the firewall for this communication? For example, I know I need 50070 to the NameNode. But what other ports are required?

1 ACCEPTED SOLUTION

avatar
Thanks for logging a case. Just for completeness, here's the answer

Datanode: 1004 (with kerberos), 50010 (without kerberos), 50020 (always)
​Namenode: 8020​

​The list of ports is documented here for future reference
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_ig_ports.html
Regards,
Gautam Gopalakrishnan

View solution in original post

4 REPLIES 4

avatar

I have a case open with Cloudera support to get an answer. 

avatar
Thanks for logging a case. Just for completeness, here's the answer

Datanode: 1004 (with kerberos), 50010 (without kerberos), 50020 (always)
​Namenode: 8020​

​The list of ports is documented here for future reference
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_ig_ports.html
Regards,
Gautam Gopalakrishnan

avatar

Sorry, still isn't clear what would be the source and destination on the ACL be?

 

Let's say, we have clusters A and B.

 

Cluster A datanodes are datanode-A-[1,10] and namenode is namenode-A-1. And, Cluster B datanodes are datanode-B-[1,10] and namenode is namenode-B-1. 

 

1) Do I initiate "distcp" on a host on Cluster A or B?

2) Do ports need to be opened up A->B or B->A?

3) If it is A->B, then what hosts on A need access to what hosts on B?

4) If it is B->A, then what hosts on B need access to what hosts on A?

avatar
New Contributor

The simple answer is to open up the ports in a bidirectional manner on all the hosts.  For instance:

 

on each node in cluster A:  Allow connectivity to 1004 (or 50010 without Kerberos) and 50020 on each datanode in cluster B. As well as 8020 to namenodes in Cluster B.

 

on each node in cluster B: Allow connectivity to 1004 (or 50010 without Kerberos) and 50020 on each datanode in cluster A. As well as 8020 to namenodes in Cluster A.

 

However... You are right, where the distcp is executed will determine the source/destination.  Executing distcp on Cluster A will cause a mapreduce job to run on cluster A.  Each datanode will(may) run a task that will connect to the namenode(s) on cluster B for block locations and then datanodes on cluster B for transfer.  I'm not sure if the node the distcp is executed on will need access as well.  So I generally run the distcp on one of the datanodes.