Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out

avatar
New Contributor

Hi, Team

 

I have a problem when querying select count(*) from table atau select distinct field from table atau select * from table order by field.

 

when checked through yarn, then application details.

error appears :

 

2017-10-03 16:22:54,826 INFO [IPC Server handler 12 on 35262] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1506588001647_0368_r_000000_0 is : 0.21282798
2017-10-03 16:22:54,827 FATAL [IPC Server handler 10 on 35262] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1506588001647_0368_r_000000_0 - exited : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#3
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:391)
at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:306)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:294)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:335)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198)

 

can be helped on this 

 

Many Thanks

4 REPLIES 4

avatar
Champion

could you fire the below commands in master and slave 

netstat -anp | grep 50060

also see if you can ping your slave from master and vice versa 

looks like issue between them 

avatar
New Contributor
Hi csguna
Thanks For Support, can be explained, just host Task Tracker or include the master host, because if not the host task tracker status closed ?

what about the host job tracker, where the port is 50030, if the check must be closed port ?

avatar
Explorer

You can set auth in config. such as :

 

 

        configuration.set("yarn.nodemanager.webapp.spnego-principal", "HTTP/_HOST@DEMO.CN");
        configuration.set("yarn.resourcemanager.webapp.spnego-principal", "HTTP/_HOST@DEMO.CN");

 

 

becsuse of cache file in cluster, when reducer is going, job clould fetch data from others node, so set yarn web auth is needed.

 

so your all auth config is :

 

 

package demo.utils;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.security.UserGroupInformation;

public class Auth {
    private String keytab;

    public Auth(String keytab) {
        this.keytab = keytab;
    }

    public void authorization(Configuration configuration) {
        System.setProperty("java.security.krb5.conf", "/etc/krb5.conf");
        configuration.set("hadoop.security.authentication", "Kerberos");
        configuration.set("fs.defaultFS", "hdfs://m1.DEMO.CN");
        configuration.set("dfs.namenode.kerberos.principal.pattern", "nn/*@DEMO.CN");
        configuration.set("yarn.nodemanager.principal", "nm/_HOST@DEMO.CN");
        configuration.set("yarn.resourcemanager.principal", "rm/_HOST@DEMO.CN");
        configuration.set("yarn.nodemanager.webapp.spnego-principal", "HTTP/_HOST@DEMO.CN");
        configuration.set("yarn.resourcemanager.webapp.spnego-principal", "HTTP/_HOST@DEMO.CN");
        UserGroupInformation.setConfiguration(configuration);
        try {
            UserGroupInformation.setConfiguration(configuration);
            UserGroupInformation.loginUserFromKeytab("user@DEMO.CN", this.keytab);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

 

avatar
Rising Star

Usually, ​Exception: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out is caused by communication issues among Hadoop cluster nodes.

 

To resolve this issue, check the following:
 
a) Whether there are any communication problems among the Hadoop cluster nodes.
b) Whether SSL certificate of any data node has expired (If Hadoop cluster is SSL enabled). 
c) If the SSL changes were made and services that are using the SSL is not restarted after the activity the issue will occur, need to restart the services in the cluster which are using the SSL.