Reply
Highlighted
Expert Contributor
Posts: 62
Registered: ‎06-03-2014
Accepted Solution

What firewall ports should I open for distcp between clusters?

I have two clusters behind a firewall and I would like run distcp to copy data from one cluster to another. What ports should I open in the firewall for this communication? For example, I know I need 50070 to the NameNode. But what other ports are required?

Explorer
Posts: 6
Registered: ‎02-24-2014

Re: What firewall ports should I open for distcp between clusters?

I have a case open with Cloudera support to get an answer. 

Cloudera Employee
Posts: 576
Registered: ‎01-20-2014

Re: What firewall ports should I open for distcp between clusters?

Thanks for logging a case. Just for completeness, here's the answer

Datanode: 1004 (with kerberos), 50010 (without kerberos), 50020 (always)
​Namenode: 8020​

​The list of ports is documented here for future reference
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_ig_ports.html
Regards,
Gautam Gopalakrishnan
Cloudera Support
Explorer
Posts: 6
Registered: ‎02-24-2014

Re: What firewall ports should I open for distcp between clusters?

Sorry, still isn't clear what would be the source and destination on the ACL be?

 

Let's say, we have clusters A and B.

 

Cluster A datanodes are datanode-A-[1,10] and namenode is namenode-A-1. And, Cluster B datanodes are datanode-B-[1,10] and namenode is namenode-B-1. 

 

1) Do I initiate "distcp" on a host on Cluster A or B?

2) Do ports need to be opened up A->B or B->A?

3) If it is A->B, then what hosts on A need access to what hosts on B?

4) If it is B->A, then what hosts on B need access to what hosts on A?

New Contributor
Posts: 2
Registered: ‎05-20-2016

Re: What firewall ports should I open for distcp between clusters?

The simple answer is to open up the ports in a bidirectional manner on all the hosts.  For instance:

 

on each node in cluster A:  Allow connectivity to 1004 (or 50010 without Kerberos) and 50020 on each datanode in cluster B. As well as 8020 to namenodes in Cluster B.

 

on each node in cluster B: Allow connectivity to 1004 (or 50010 without Kerberos) and 50020 on each datanode in cluster A. As well as 8020 to namenodes in Cluster A.

 

However... You are right, where the distcp is executed will determine the source/destination.  Executing distcp on Cluster A will cause a mapreduce job to run on cluster A.  Each datanode will(may) run a task that will connect to the namenode(s) on cluster B for block locations and then datanodes on cluster B for transfer.  I'm not sure if the node the distcp is executed on will need access as well.  So I generally run the distcp on one of the datanodes.

Announcements