- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
DistCp over Oozie .vs. from shell
- Labels:
-
Apache Oozie
-
Security
Created on ‎03-31-2019 06:26 AM - edited ‎09-16-2022 07:16 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi. We have a client who has 2 clusters. On the security cluster, they have sensitive data that they redact and copy to the analysis cluster. For security reasons, they would like to minimize the number of open ports on the security cluster. We have successfully tested using distcp from the shell to copy the data with port 8020 open. They would now like to automate the process through oozie. In testing, we have run into an error that port 8042 (Node Manager External Port) is not open.
We do not understand why distcp works fine without port 8042 available when run through the shell but fails when called through Oozie.
Any help would be appreciated. Thanks.
Henry
Created ‎04-03-2019 06:53 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jobs should only need to contact the NodeManagers of the cluster it runs
on, but if the submitted cluster is remote then the ports may need to be
opened.
The HDFS transfer part does not involve YARN service communication at all,
so it is not expected to contact a NodeManager.
It would be helpful if you can share some more logs leading up to the
observed failure.
Created ‎04-01-2019 06:55 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
for the action ID and the action launcher job map task logs?
The 8042 port is the NodeManager HTTP port, useful in serving logs of live
containers among other status details over REST. It is not directly used by
DistCp in its functions, but MapReduce and Oozie diagnostics might be
invoking it as part of a response to a failure, so it is a secondary
symptom.
Note though that running DistCp via Oozie requires you to provide
appropriate configs that ensures delegation tokens for both kerberized
clusters are acquired. Use "mapreduce.job.hdfs-servers" with a value such
as "hdfs://namenode-cluster-1,hdfs://namenode-cluster-2" to influence this
on the Oozie server's delegation token acquisition phase. This is only
relevant if you use Kerberos on both clusters.
Created ‎04-02-2019 06:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your reply. Could we ask a related question? Our client is very reluctant to open the ports on these 2 clusters. Could you tell us what ports need to be open for distcp to function properly? After many fails, our client has briefly allowed all the ports to be open. With that change, distcp if working properly. We have already looked at the ports specified in https://www.cloudera.com/documentation/enterprise/latest/topics/install_ports_distcp.html#topic_9_1.
Are there any hidden ports or secondary ports beyond the above documentation that could be causing the problem?
Created ‎04-03-2019 06:53 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jobs should only need to contact the NodeManagers of the cluster it runs
on, but if the submitted cluster is remote then the ports may need to be
opened.
The HDFS transfer part does not involve YARN service communication at all,
so it is not expected to contact a NodeManager.
It would be helpful if you can share some more logs leading up to the
observed failure.
Created ‎04-04-2019 04:58 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎04-13-2019 01:40 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Harsh J: Thanks for the help on the previous issue. We finally resolved the issue. It was due to an undocumented port required in the CDH 6.2 to CDH 6.2 distcp. Now, we are migrating the task over to Oozie and having some trouble. Could you elaborate a bit more or give us some links or pointers? Thanks.
We could not find "mapreduce.job.hdfs-servers" . Where is that?
