Hi,
I try to run TPCx-HS2 - which is basically TeraSort - to test my Hadoop/Yarn cluster. For the generation and validation part, everything works fine. The sorting itself also works, but in the end, it crashes because the MR JobHistory server doesn't know the job id. I double checked the configuration and the history server is available and also the Gens/Validations before and after the sorts do show up. The only difference is that of course generation/validation is a lot faster than sorting, but I don't know why this can lead to the job ID being unknown.
You can see my log below. Any help is much appreciated...
2020-03-31 13:07:20,109 INFO mapreduce.Job: map 100% reduce 97%
2020-03-31 13:10:00,277 INFO mapreduce.Job: map 100% reduce 98%
2020-03-31 13:12:05,179 INFO mapreduce.Job: map 100% reduce 99%
2020-03-31 13:14:27,607 INFO mapreduce.Job: map 100% reduce 100%
2020-03-31 13:14:40,956 INFO mapreduce.Job: Job job_1585647217951_0003 completed successfully
2020-03-31 13:14:41,256 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=
2020-03-31 13:14:41,674 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=
2020-03-31 13:14:41,790 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=
Exception in thread "main" java.io.IOException: java.io.IOException: Unknown Job job_1585647217951_0003
at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.verifyAndGetJob(HistoryC
at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.getCounters(HistoryClien
at org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getCounters(MRClientPr
at org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProt
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1000)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2916)
Created 03-31-2020 05:23 AM
Oh, in addition, just to show that the jobs with suffixes before and after _3 show up in the log directory of the history server:
drwxrwxrwt - hadoop hadoop 0 2020-03-31 11:33 /user/history
drwxrwx--- - hadoop hadoop 0 2020-03-31 11:54 /user/history/done
drwxrwx--- - hadoop hadoop 0 2020-03-31 11:54 /user/history/done/2020
drwxrwx--- - hadoop hadoop 0 2020-03-31 11:54 /user/history/done/2020/03
drwxrwx--- - hadoop hadoop 0 2020-03-31 11:54 /user/history/done/2020/03/31
drwxrwx--- - hadoop hadoop 0 2020-03-31 13:44 /user/history/done/2020/03/31/000000
-rwxrwx--- 1 root hadoop 68050 2020-03-31 11:52 /user/history/done/2020/03/31/000000/job_1585647217951_0002-1585647944784-root-HSGen-1585648448385-15-0-SUCCEEDED-default-1585647977205.jhist
-rwxrwx--- 1 root hadoop 215999 2020-03-31 11:52 /user/history/done/2020/03/31/000000/job_1585647217951_0002_conf.xml
-rwxrwx--- 1 root hadoop 51476 2020-03-31 13:18 /user/history/done/2020/03/31/000000/job_1585647217951_0004-1585653295800-root-HSValidate-1585653602686-8-1-SUCCEEDED-default-1585653383672.jhist
-rwxrwx--- 1 root hadoop 216412 2020-03-31 13:18 /user/history/done/2020/03/31/000000/job_1585647217951_0004_conf.xml
-rwxrwx--- 1 root hadoop 67441 2020-03-31 13:36 /user/history/done/2020/03/31/000000/job_1585647217951_0005-1585654239304-root-HSGen-1585654641074-15-0-SUCCEEDED-default-1585654270486.jhist
-rwxrwx--- 1 root hadoop 215999 2020-03-31 13:36 /user/history/done/2020/03/31/000000/job_1585647217951_0005_conf.xml
drwxrwxrwt - hadoop hadoop 0 2020-03-31 11:45 /user/history/done_intermediate
drwxrwx--- - hadoop hadoop 0 2020-03-31 11:34 /user/history/done_intermediate/hadoop
drwxrwx--- - root hadoop 0 2020-03-31 13:44 /user/history/done_intermediate/root
Created 03-31-2020 09:50 AM
Are there any error in JHS logs especially around this timeframe 2020-03-31 13:14:* ?