Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

error in shuffle in fetcher#1 Hadoop error

avatar
New Contributor

I'm trying to run the Hadoop pi example. It was running without any problems on a single node. But now I'm working on a multinode and its giving the below error. If anyone could please advise.

mapred-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<!-- In: conf/mapred-site.xml -->
<property>
  <name>mapred.job.tracker</name>
  <value>master:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
<property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx2048m</value>
</property>
<property>
    <name>mapred.shuffle.input.buffer.percent</name>
    <value>0.2</value>
  </property>
</configuration>

Console output:

Number of Maps  = 3
Samples per Map = 10
14/10/11 20:34:20 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
14/10/11 20:34:54 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Starting Job
14/10/11 20:34:54 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/10/11 20:34:55 INFO input.FileInputFormat: Total input paths to process : 3
14/10/11 20:34:55 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/10/11 20:34:55 INFO mapreduce.JobSubmitter: number of splits:3
14/10/11 20:34:55 INFO mapreduce.JobSubmitter: adding the following namenodes' delegation tokens:null
14/10/11 20:34:55 INFO mapreduce.Job: Running job: job_201410112034_0001
14/10/11 20:34:56 INFO mapreduce.Job:  map 0% reduce 0%
14/10/11 20:35:05 INFO mapreduce.Job:  map 33% reduce 0%
14/10/11 20:35:08 INFO mapreduce.Job:  map 100% reduce 0%
14/10/11 20:35:14 INFO mapreduce.Job:  map 100% reduce 11%
14/10/11 20:35:31 INFO mapreduce.Job: Task Id : attempt_201410112034_0001_r_000000_0, Status : FAILED
org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#1
    at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
    at org.apache.hadoop.mapred.Child.main(Child.java:211)
Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    at org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.checkReducerHealth(ShuffleScheduler.java:253)
    at org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.copyFailed(ShuffleScheduler.java:187)
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:234)
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:149)

14/10/11 20:35:32 INFO mapreduce.Job:  map 100% reduce 0%
14/10/11 20:35:41 INFO mapreduce.Job:  map 100% reduce 11%
14/10/11 20:35:49 INFO mapreduce.Job: Task Id : attempt_201410112034_0001_m_000000_0, Status : FAILED
Too many fetch-failures
14/10/11 20:35:49 WARN mapreduce.Job: Error reading task outputhttp://userA:50060/tasklog?plaintext=true&attemptid=attempt_201410112034_0001_m_000000_0&filter=stdout
14/10/11 20:35:49 WARN mapreduce.Job: Error reading task outputhttp://userA:50060/tasklog?plaintext=true&attemptid=attempt_201410112034_0001_m_000000_0&filter=stderr
14/10/11 20:36:13 INFO mapreduce.Job: Task Id : attempt_201410112034_0001_r_000000_1, Status : FAILED
org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#2
    at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
    at org.apache.hadoop.mapred.Child.main(Child.java:211)
Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    at org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.checkReducerHealth(ShuffleScheduler.java:253)
    at org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.copyFailed(ShuffleScheduler.java:187)
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:234)
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:149)

14/10/11 20:36:14 INFO mapreduce.Job:  map 100% reduce 0%
14/10/11 20:36:22 INFO mapreduce.Job: Task Id : attempt_201410112034_0001_m_000001_0, Status : FAILED
Too many fetch-failures
14/10/11 20:36:22 WARN mapreduce.Job: Error reading task outputhttp://userA:50060/tasklog?plaintext=true&attemptid=attempt_201410112034_0001_m_000001_0&filter=stdout
14/10/11 20:36:22 WARN mapreduce.Job: Error reading task outputhttp://userA:50060/tasklog?plaintext=true&attemptid=attempt_201410112034_0001_m_000001_0&filter=stderr
14/10/11 20:36:23 INFO mapreduce.Job:  map 100% reduce 11%
14/10/11 20:36:32 INFO mapreduce.Job:  map 100% reduce 100%
14/10/11 20:36:34 INFO mapreduce.Job: Job complete: job_201410112034_0001
14/10/11 20:36:34 INFO mapreduce.Job: Counters: 33
    FileInputFormatCounters
        BYTES_READ=354
    FileSystemCounters
        FILE_BYTES_READ=72
        FILE_BYTES_WRITTEN=252
        HDFS_BYTES_READ=765
        HDFS_BYTES_WRITTEN=215
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=1
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    Job Counters 
        Data-local map tasks=5
        Total time spent by all maps waiting after reserving slots (ms)=0
        Total time spent by all reduces waiting after reserving slots (ms)=0
        SLOTS_MILLIS_MAPS=11950
        SLOTS_MILLIS_REDUCES=80809
        Launched map tasks=5
        Launched reduce tasks=3
    Map-Reduce Framework
        Combine input records=0
        Combine output records=0
        Failed Shuffles=1
        GC time elapsed (ms)=6
        Map input records=3
        Map output bytes=54
        Map output records=6
        Merged Map outputs=3
        Reduce input groups=2
        Reduce input records=6
        Reduce output records=0
        Reduce shuffle bytes=84
        Shuffled Maps =3
        Spilled Records=12
        SPLIT_RAW_BYTES=411
Job Finished in 100.067 seconds
Estimated value of Pi is 3.60000000000000000000

EDIT:

According to sonic's comment. I followed the answer in this question shuffle error:exceeded max_failed_unique_matche : bailing out but when I ping the master with the slave they are working, but when I use curl -I http://slave:50060/ it gives this error curl: (7) couldn't connect to host

I used this to check the port telnet 192.168.0.1 50060

and this was the output:

Trying 192.168.0.1... 
telnet: Unable to connect to remote host: Connection refused

then I did this: sudo netstat -plntu

and that was the result:

Active Internet connections (only servers) 
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name 
tcp 0 0 127.0.0.1:53 0.0.0.0:* LISTEN 1746/dnsmasq 
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 693/sshd 
tcp 0 0 127.0.0.1:631 0.0.0.0:* LISTEN 986/cupsd 
tcp6 0 0 :::22 :::* LISTEN 693/sshd 
udp 0 0 127.0.0.1:53 0.0.0.0:* 1746/dnsmasq 
udp 0 0 0.0.0.0:68 0.0.0.0:* 12396/dhclient 
udp 0 0 0.0.0.0:5353 0.0.0.0:* 915/avahi-daemon: r 
udp 0 0 0.0.0.0:46881 0.0.0.0:* 915/avahi-daemon: r 
udp6 0 0 :::5353 :::* 915/avahi-daemon: r 
udp6 0 0 :::47491 :::* 915/avahi-daemon: r

so if anyone could please advise.

Here are the files I'm using which are the same on both the master and slave machine: slaves:

master
slave

masters:

master

etc/hosts:

192.168.0.1    master
192.168.0.2    slave
1 REPLY 1

avatar
Mentor
One possibility of issue is that your services are incorrectly listening on the wrong interface.

What is the output of "netstat -anp | grep 50060" specifically from your 'master' and 'slave' hosts?