Support Questions

Find answers, ask questions, and share your expertise

hive testbench shuffle error when running tpch setup

avatar
Explorer

Getting the below error:

Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#4
Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out

2 REPLIES 2

avatar
Explorer

./tpch-setup.sh 5 /hive-data-dir-benchmark
TPC-H text data generation complete.
Loading text data into external tables.
WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.
Optimizing table part (1/8).

^CCommand failed, try 'export DEBUG_SCRIPT=ON' and re-running
sandbox.cluster.com /home/c095784/benchmarking/hive-testbench :
sandbox.cluster.com /home/c095784/benchmarking/hive-testbench :
sandbox.cluster.com /home/c095784/benchmarking/hive-testbench :
sandbox.cluster.com /home/c095784/benchmarking/hive-testbench : export DEBUG_SCRIPT=ON
sandbox.cluster.com /home/c095784/benchmarking/hive-testbench : ./tpch-setup.sh 5 /hive-data-dir-benchmark
+ '[' X5 = X ']'
+ '[' X/hive-data-dir-benchmark = X ']'
+ '[' 5 -eq 1 ']'
+ hdfs dfs -mkdir -p /hive-data-dir-benchmark
+ hdfs dfs -ls /hive-data-dir-benchmark/5/lineitem
+ '[' 0 -ne 0 ']'
+ hdfs dfs -ls /hive-data-dir-benchmark/5/lineitem
+ '[' 0 -ne 0 ']'
+ echo 'TPC-H text data generation complete.'
TPC-H text data generation complete.
+ echo 'Loading text data into external tables.'
Loading text data into external tables.
+ runcommand 'hive -i settings/load-flat.sql -f ddl-tpch/bin_flat/alltables.sql -d DB=tpch_text_5 -d LOCATION=/hive-data-dir-benchmark/5'
+ '[' XON '!=' X ']'
+ hive -i settings/load-flat.sql -f ddl-tpch/bin_flat/alltables.sql -d DB=tpch_text_5 -d LOCATION=/hive-data-dir-benchmark/5

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/jars/hive-common-1.1.0-cdh5.15.2.jar!/hive-log4j.properties
OK
Time taken: 2.198 seconds
OK
Time taken: 0.013 seconds
OK
Time taken: 0.578 seconds
OK
Time taken: 1.183 seconds
OK
Time taken: 0.814 seconds
OK
Time taken: 0.494 seconds
OK
Time taken: 0.504 seconds
OK
Time taken: 0.493 seconds
OK
Time taken: 0.506 seconds
OK
Time taken: 0.495 seconds
OK
Time taken: 0.502 seconds
OK
Time taken: 0.494 seconds
OK
Time taken: 0.503 seconds
OK
Time taken: 0.496 seconds
OK
Time taken: 0.505 seconds
OK
Time taken: 0.495 seconds
OK
Time taken: 0.503 seconds
OK
Time taken: 0.495 seconds
WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.
+ i=1
+ total=8
+ test 5 -le 1000
+ SCHEMA_TYPE=flat
+ DATABASE=tpch_flat_orc_5
+ MAX_REDUCERS=2600
++ test 5 -gt 2600
++ echo 5
+ REDUCERS=5
+ for t in '${TABLES}'
+ echo 'Optimizing table part (1/8).'
Optimizing table part (1/8).
+ COMMAND='hive -i settings/load-flat.sql -f ddl-tpch/bin_flat/part.sql -d DB=tpch_flat_orc_5 -d SOURCE=tpch_text_5 -d BUCKETS=13 -d SCALE=5 -d REDUCERS=5 -d FILE=orc'
+ runcommand 'hive -i settings/load-flat.sql -f ddl-tpch/bin_flat/part.sql -d DB=tpch_flat_orc_5 -d SOURCE=tpch_text_5 -d BUCKETS=13 -d SCALE=5 -d REDUCERS=5 -d FILE=orc'
+ '[' XON '!=' X ']'
+ hive -i settings/load-flat.sql -f ddl-tpch/bin_flat/part.sql -d DB=tpch_flat_orc_5 -d SOURCE=tpch_text_5 -d BUCKETS=13 -d SCALE=5 -d REDUCERS=5 -d FILE=orc

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/jars/hive-common-1.1.0-cdh5.15.2.jar!/hive-log4j.properties
OK
Time taken: 2.152 seconds
OK
Time taken: 0.017 seconds
OK
Time taken: 0.051 seconds
Query ID = c095784_20200816094848_3f33b234-3f7f-4d1b-b862-2624c0bb43cd
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
^C+ '[' 130 -ne 0 ']'
+ echo 'Command failed, try '\''export DEBUG_SCRIPT=ON'\'' and re-running'
Command failed, try 'export DEBUG_SCRIPT=ON' and re-running
+ exit 1
sandbox.cluster.com /home/c095784/benchmarking/hive-testbench :
sandbox.cluster.com /home/c095784/benchmarking/hive-testbench :
sandbox.cluster.com /home/c095784/benchmarking/hive-testbench :
sandbox.cluster.com /home/c095784/benchmarking/hive-testbench :
sandbox.cluster.com /home/c095784/benchmarking/hive-testbench :
sandbox.cluster.com /home/c095784/benchmarking/hive-testbench : klist
Ticket cache: FILE:/tmp/krb5cc_895784
Default principal: neha@EXELONDS.COM

Valid starting Expires Service principal
08/16/20 08:41:04 08/16/20 18:41:04 krbtgt/EXELONDS.COM@EXELONDS.COM
08/16/20 08:41:04 08/16/20 18:41:04 BDAL1CCC1N06$@EXELONDS.COM
sandbox.cluster.com /home/c095784/benchmarking/hive-testbench : export DEBUG_SCRIPT=ON
sandbox.cluster.com /home/c095784/benchmarking/hive-testbench : ./tpch-setup.sh 5 /hive-data-dir-benchmark
+ '[' X5 = X ']'
+ '[' X/hive-data-dir-benchmark = X ']'
+ '[' 5 -eq 1 ']'
+ hdfs dfs -mkdir -p /hive-data-dir-benchmark
+ hdfs dfs -ls /hive-data-dir-benchmark/5/lineitem
+ '[' 0 -ne 0 ']'
+ hdfs dfs -ls /hive-data-dir-benchmark/5/lineitem
+ '[' 0 -ne 0 ']'
+ echo 'TPC-H text data generation complete.'
TPC-H text data generation complete.
+ echo 'Loading text data into external tables.'
Loading text data into external tables.
+ runcommand 'hive -i settings/load-flat.sql -f ddl-tpch/bin_flat/alltables.sql -d DB=tpch_text_5 -d LOCATION=/hive-data-dir-benchmark/5'
+ '[' XON '!=' X ']'
+ hive -i settings/load-flat.sql -f ddl-tpch/bin_flat/alltables.sql -d DB=tpch_text_5 -d LOCATION=/hive-data-dir-benchmark/5

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/jars/hive-common-1.1.0-cdh5.15.2.jar!/hive-log4j.properties
OK
Time taken: 2.225 seconds
OK
Time taken: 0.018 seconds
OK
Time taken: 0.802 seconds
OK
Time taken: 0.991 seconds
OK
Time taken: 0.506 seconds
OK
Time taken: 0.494 seconds
OK
Time taken: 0.504 seconds
OK
Time taken: 0.493 seconds
OK
Time taken: 0.505 seconds
OK
Time taken: 0.495 seconds
OK
Time taken: 0.502 seconds
OK
Time taken: 0.496 seconds
OK
Time taken: 0.503 seconds
OK
Time taken: 0.495 seconds
OK
Time taken: 0.502 seconds
OK
Time taken: 0.496 seconds
OK
Time taken: 0.503 seconds
OK
Time taken: 0.497 seconds
WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.
+ i=1
+ total=8
+ test 5 -le 1000
+ SCHEMA_TYPE=flat
+ DATABASE=tpch_flat_orc_5
+ MAX_REDUCERS=2600
++ test 5 -gt 2600
++ echo 5
+ REDUCERS=5
+ for t in '${TABLES}'
+ echo 'Optimizing table part (1/8).'
Optimizing table part (1/8).
+ COMMAND='hive -i settings/load-flat.sql -f ddl-tpch/bin_flat/part.sql -d DB=tpch_flat_orc_5 -d SOURCE=tpch_text_5 -d BUCKETS=13 -d SCALE=5 -d REDUCERS=5 -d FILE=orc'
+ runcommand 'hive -i settings/load-flat.sql -f ddl-tpch/bin_flat/part.sql -d DB=tpch_flat_orc_5 -d SOURCE=tpch_text_5 -d BUCKETS=13 -d SCALE=5 -d REDUCERS=5 -d FILE=orc'
+ '[' XON '!=' X ']'
+ hive -i settings/load-flat.sql -f ddl-tpch/bin_flat/part.sql -d DB=tpch_flat_orc_5 -d SOURCE=tpch_text_5 -d BUCKETS=13 -d SCALE=5 -d REDUCERS=5 -d FILE=orc

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/jars/hive-common-1.1.0-cdh5.15.2.jar!/hive-log4j.properties
OK
Time taken: 2.126 seconds
OK
Time taken: 0.015 seconds
OK
Time taken: 0.049 seconds
Query ID = c095784_20200816115353_616b8c96-f2da-4ea7-94a3-2d1501c02691
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1597596829164_0002, Tracking URL = http://sandbox.cluster.com:8088/proxy/application_1597596829164_0002/
Kill Command = /opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/lib/hadoop/bin/hadoop job -kill job_1597596829164_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-08-16 11:54:44,433 Stage-1 map = 0%, reduce = 0%
2020-08-16 11:54:54,917 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.9 sec
2020-08-16 11:55:13,757 Stage-1 map = 0%, reduce = 0%
2020-08-16 11:55:18,933 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.41 sec
2020-08-16 11:55:19,971 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.41 sec
MapReduce Total cumulative CPU time: 3 seconds 410 msec
Ended Job = job_1597596829164_0002 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1597596829164_0002_m_000000 (and more) from job job_1597596829164_0002

Task with the most failures(4):
-----
Task ID:
task_1597596829164_0002_r_000000

URL:
http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1597596829164_0002&tipid=task_1597596829164_0002_r_000...
-----
Diagnostic Messages for this Task:
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#4
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:392)
at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:307)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:366)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198)


FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.41 sec HDFS Read: 5155 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 3 seconds 410 msec
WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.
+ '[' 2 -ne 0 ']'
+ echo 'Command failed, try '\''export DEBUG_SCRIPT=ON'\'' and re-running'
Command failed, try 'export DEBUG_SCRIPT=ON' and re-running
+ exit 1

 

avatar
Cloudera Employee

The issue could be associated with the corrupt permissions in NM local-dir (usercache) are causing the shuffle fetch failure which may occur due to the below reasons,

 

1.Disk failure cause the fetch failure

2. Network issue causes the fetch failure

 

This can be overcome by carrying out the below steps in the NodeManger installed nodes.

 

* Stop the NM instance on the node
* Remove YARN NM local-dir path
- sudo rm -rf /JBOD_D${i}/hadoop/cdh/yarn/nm/*" where ${i} is the iteration over the disks that each participate as NM local-dir storage
- the idea is to leave /JBOD_D${i}/hadoop/cdh/yarn/nm in place on each disk, but remove everything below that point.
* Restart the NM instance and it will recreate the structure with the appropriate permissions.

 

Once you have implemented the above solution and if it continues to  working well, then please spare few minutes to hit the "Accept as Solution"