Welcome to the Cloudera Community

Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Who agreed with this topic

TLS ON HDFS MR jobs failing with Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES

avatar
Contributor

We are facing issues while configuration of TLS for HDFS/YARN.  We were able to isolate the problem to enabling TLS at HDFS causing MR jobs to fail.  TLS enabled for HDFS is strongly recommended by Cloudera for HDFS encryption but when we enabled TLS for HDFS, simple jobs like helloworld example of MR jobs are failing  with errors.

 

Please see the session output when TLS is enabled at HDFS/YARN.   The same program works fine without any error when TLS is disabled.  Any help will be appreciated. 

 

CDH Version : 5.7.4

Keytrustee server version : 5.7.0-1.keytrustee5.7.0.p0.5  (No HA configured. Single node cluster)

Keytrustree KMS versions: 5.8.2-5.KEYTRUSTEE5.8.2.p0.1

Clouster is Kerberos enabled

 

Few internal system names are changed with generic name for confidentiality.  

 

 

 

[root@ ~]# hadoop jar /opt/cloudera/parcels/CDH-5.7.4-1.cdh5.7.4.p0.2/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.4.jar wordcount /tmp/ngdb_files.txt /tmp/five

17/02/22 08:17:59 INFO hdfs.DFSClient: Created token for hive: HDFS_DELEGATION_TOKENowner=hive/192.168.11.222@DOMAINNAME, renewer=yarn, realUser=, issueDate=1487769479000, maxDate=1488374279000, sequenceNumber=398, masterKeyId=25 on ha-hdfs:cemodcluster

17/02/22 08:17:59 INFO security.TokenCache: Got dt for hdfs://cemodcluster; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:cemodcluster, Ident: (token for hive: HDFS_DELEGATION_TOKEN owner=hive/192.168.11.222@DOMAINNAME, renewer=yarn, realUser=, issueDate=1487769479000, maxDate=1488374279000, sequenceNumber=398, masterKeyId=25)

17/02/22 08:17:59 WARN token.Token: Cannot find class for token kind kms-dt

17/02/22 08:17:59 INFO security.TokenCache: Got dt for hdfs://cemodcluster; Kind: kms-dt, Service: 192.168.11.25:16000, Ident: 00 04 68 69 76 65 04 79 61 72 6e 00 8a 01 5a 65 f8 a8 5f 8a 01 5a 8a 05 2c 5f 8e 01 43 08

17/02/22 08:17:59 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm29

17/02/22 08:17:59 INFO input.FileInputFormat: Total input paths to process : 1

17/02/22 08:17:59 INFO mapreduce.JobSubmitter: number of splits:1

17/02/22 08:18:00 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1487769357731_0005

17/02/22 08:18:00 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:cemodcluster, Ident: (token for hive: HDFS_DELEGATION_TOKENowner=hive/192.168.11.222@DOMAINNAME, renewer=yarn, realUser=, issueDate=1487769479000, maxDate=1488374279000, sequenceNumber=398, masterKeyId=25)

17/02/22 08:18:00 WARN token.Token: Cannot find class for token kind kms-dt

17/02/22 08:18:00 WARN token.Token: Cannot find class for token kind kms-dt

Kind: kms-dt, Service: 192.168.11.25:16000, Ident: 00 04 68 69 76 65 04 79 61 72 6e 00 8a 01 5a 65 f8 a8 5f 8a 01 5a 8a 05 2c 5f 8e 01 43 08

17/02/22 08:18:00 INFO impl.YarnClientImpl: Submitted application application_1487769357731_0005

17/02/22 08:18:00 INFO mapreduce.Job: The url to track the job:https://hadooppassive:8090/proxy/application_1487769357731_0005/

17/02/22 08:18:00 INFO mapreduce.Job: Running job: job_1487769357731_0005

17/02/22 08:18:07 INFO mapreduce.Job: Job job_1487769357731_0005 running in uber mode : false

17/02/22 08:18:07 INFO mapreduce.Job:  map 0% reduce 0%

17/02/22 08:18:14 INFO mapreduce.Job:  map 100% reduce 0%

17/02/22 08:19:18 INFO mapreduce.Job: Task Id : attempt_1487769357731_0005_r_000000_0, Status : FAILED

Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#1

        at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)

        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)

        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:415)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)

        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

        at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:366)

        at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:288)

        at org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:282)

        at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:323)

        at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)

 

17/02/22 08:20:24 INFO mapreduce.Job: Task Id : attempt_1487769357731_0005_r_000000_1, Status : FAILED

Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#1

        at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)

        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)

        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:415)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)

        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

        at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:366)

        at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:288)

        at org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:282)

        at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:323)

        at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)

 

17/02/22 08:21:29 INFO mapreduce.Job: Task Id : attempt_1487769357731_0005_r_000000_2, Status : FAILED

Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#1

        at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)

        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)

        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:415)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)

       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

        at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:366)

        at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:288)

        at org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:282)

        at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:323)

        at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)

 

17/02/22 08:22:35 INFO mapreduce.Job:  map 100% reduce 100%

17/02/22 08:24:17 INFO mapreduce.Job: Job job_1487769357731_0005 failed with state FAILED due to: Task failed task_1487769357731_0005_r_000000

Job failed as tasks failed. failedMaps:0 failedReduces:1

 

17/02/22 08:24:17 INFO mapreduce.Job: Counters: 38

        File System Counters

                FILE: Number of bytes read=0

                FILE: Number of bytes written=133099

                FILE: Number of read operations=0

                FILE: Number of large read operations=0

                FILE: Number of write operations=0

                HDFS: Number of bytes read=59980

                HDFS: Number of bytes written=0

                HDFS: Number of read operations=3

                HDFS: Number of large read operations=0

                HDFS: Number of write operations=0

        Job Counters

                Failed reduce tasks=4

                Launched map tasks=1

                Launched reduce tasks=4

                Data-local map tasks=1

                Total time spent by all maps in occupied slots (ms)=10510

                Total time spent by all reduces in occupied slots (ms)=1002256

                Total time spent by all map tasks (ms)=5255

                Total time spent by all reduce tasks (ms)=250564

                Total vcore-seconds taken by all map tasks=5255

                Total vcore-seconds taken by all reduce tasks=250564

                Total megabyte-seconds taken by all map tasks=21524480

                Total megabyte-seconds taken by all reduce tasks=2052620288

        Map-Reduce Framework

                Map input records=596

                Map output records=4768

                Map output bytes=72441

                Map output materialized bytes=6629

                Input split bytes=103

                Combine input records=4768

                Combine output records=693

                Spilled Records=693

                Failed Shuffles=0

                Merged Map outputs=0

                GC time elapsed (ms)=0

                CPU time spent (ms)=1100

                Physical memory (bytes) snapshot=1463422976

                Virtual memory (bytes) snapshot=4292116480

                Total committed heap usage (bytes)=2058354688

        File Input Format Counters

                Bytes Read=59877

Who agreed with this topic