Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

DistCp over Oozie .vs. from shell

avatar
Contributor

Hi.  We have a client who has 2 clusters.  On the security cluster, they have sensitive data that they redact and copy to the analysis cluster.  For security reasons, they would like to minimize the number of open ports on the security cluster.  We have successfully tested using distcp from the shell to copy the data with port 8020 open.  They would now like to automate the process through oozie.  In testing, we have run into an error that port 8042 (Node Manager External Port) is not open.  

 

We do not understand why distcp works fine without port 8042 available when run through the shell but fails when called through Oozie.

 

Any help would be appreciated.  Thanks.

 

Henry

1 ACCEPTED SOLUTION

avatar
Mentor
Is the job submitted to the source cluster, or the destination? The DistCp
jobs should only need to contact the NodeManagers of the cluster it runs
on, but if the submitted cluster is remote then the ports may need to be
opened.

The HDFS transfer part does not involve YARN service communication at all,
so it is not expected to contact a NodeManager.

It would be helpful if you can share some more logs leading up to the
observed failure.

View solution in original post

5 REPLIES 5

avatar
Mentor
Could you share the full log from this failure, both from the Oozie server
for the action ID and the action launcher job map task logs?

The 8042 port is the NodeManager HTTP port, useful in serving logs of live
containers among other status details over REST. It is not directly used by
DistCp in its functions, but MapReduce and Oozie diagnostics might be
invoking it as part of a response to a failure, so it is a secondary
symptom.

Note though that running DistCp via Oozie requires you to provide
appropriate configs that ensures delegation tokens for both kerberized
clusters are acquired. Use "mapreduce.job.hdfs-servers" with a value such
as "hdfs://namenode-cluster-1,hdfs://namenode-cluster-2" to influence this
on the Oozie server's delegation token acquisition phase. This is only
relevant if you use Kerberos on both clusters.

avatar
Contributor

Thanks for your reply.  Could we ask a related question?  Our client is very reluctant to open the ports on these 2 clusters.  Could you tell us what ports need to be open for distcp to function properly?  After many fails, our client has briefly allowed all the ports to be open.  With that change, distcp if working properly.  We have already looked at the ports specified in https://www.cloudera.com/documentation/enterprise/latest/topics/install_ports_distcp.html#topic_9_1.

 

Are there any hidden ports or secondary ports beyond the above documentation that could be causing the problem?

avatar
Mentor
Is the job submitted to the source cluster, or the destination? The DistCp
jobs should only need to contact the NodeManagers of the cluster it runs
on, but if the submitted cluster is remote then the ports may need to be
opened.

The HDFS transfer part does not involve YARN service communication at all,
so it is not expected to contact a NodeManager.

It would be helpful if you can share some more logs leading up to the
observed failure.

avatar
Contributor
Thank you.  Because of the client’s security measures, we are unable to disperse the log files generated.  This, of course, makes everything so much more difficult.


avatar
Contributor

Harsh J:  Thanks for the help on the previous issue.  We finally resolved the issue.  It was due to an undocumented port required in the CDH 6.2 to CDH 6.2 distcp.  Now, we are migrating the task over to Oozie and having some trouble.  Could you elaborate a bit more or give us some links or pointers?  Thanks.

 

We could not find "mapreduce.job.hdfs-servers" . Where is that?