Support Questions

IT.Services · ‎01-08-2015

I have two clusters behind a firewall and I would like run distcp to copy data from one cluster to another. What ports should I open in the firewall for this communication? For example, I know I need 50070 to the NameNode. But what other ports are required?

GautamG · ‎01-18-2015

Thanks for logging a case. Just for completeness, here's the answer

Datanode: 1004 (with kerberos), 50010 (without kerberos), 50020 (always)
Namenode: 8020

The list of ports is documented here for future reference
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_ig_ports.html

Regards,
Gautam Gopalakrishnan

View solution in original post

siddhartha.jain-1190932798 · ‎01-16-2015

I have a case open with Cloudera support to get an answer.

GautamG · ‎01-18-2015

Thanks for logging a case. Just for completeness, here's the answer

Datanode: 1004 (with kerberos), 50010 (without kerberos), 50020 (always)
Namenode: 8020

The list of ports is documented here for future reference
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_ig_ports.html

Regards,
Gautam Gopalakrishnan

siddhartha.jain-1190932798 · ‎01-21-2015

Sorry, still isn't clear what would be the source and destination on the ACL be?

Let's say, we have clusters A and B.

Cluster A datanodes are datanode-A-[1,10] and namenode is namenode-A-1. And, Cluster B datanodes are datanode-B-[1,10] and namenode is namenode-B-1.

1) Do I initiate "distcp" on a host on Cluster A or B?

2) Do ports need to be opened up A->B or B->A?

3) If it is A->B, then what hosts on A need access to what hosts on B?

4) If it is B->A, then what hosts on B need access to what hosts on A?

JimAcxiom · ‎01-03-2018

The simple answer is to open up the ports in a bidirectional manner on all the hosts. For instance:

on each node in cluster A: Allow connectivity to 1004 (or 50010 without Kerberos) and 50020 on each datanode in cluster B. As well as 8020 to namenodes in Cluster B.

on each node in cluster B: Allow connectivity to 1004 (or 50010 without Kerberos) and 50020 on each datanode in cluster A. As well as 8020 to namenodes in Cluster A.

However... You are right, where the distcp is executed will determine the source/destination. Executing distcp on Cluster A will cause a mapreduce job to run on cluster A. Each datanode will(may) run a task that will connect to the namenode(s) on cluster B for block locations and then datanodes on cluster B for transfer. I'm not sure if the node the distcp is executed on will need access as well. So I generally run the distcp on one of the datanodes.

Cloudera Community

Support Questions

What firewall ports should I open for distcp between clusters?