I've installed a new cluster using managed installation of cdh 5 with 7 dedicated and hardware specific machines. The installation was succesful and all health tests have been passed succesfully.
Also I've tested that the network is fully operational, all ports can be accessed and the DNS is responding both direct and reverse.
My problem raises when I try to run a Hadoop application from the command line, I know that this must be a configuration error as I have ran the same JAR in an Amazon EMR machine without problems many times.
The problem is that hadoop get stuck when a certain step of the reduce phase is reached. No matter how many reduce tasks I configure for the job (from 1 to N), I can see in the application master that the running tasks are always in the same state:
28.13% RUNNING reduce > copy(27 of 32 at 1.30 MB/s)
The systems seems not to have any traffic at all, but if let them run indefinitely the program finishes ans the results are ok, but in almost a 1000% of the usual required time. In addition, several errors are raised:
shuffle error:exceeded max_failed_unique_matche : bailing out
I've seen googling around that this seems to be a common problem in many hadoop installations, but I've checked everything in the configuration witout success. I would really appreciate it if anyone could point me in another direction. Please do not hesitate to request any additional information or logs.
Additional information just to provide a wide description of the errors. When the job is finished (successfuly), the system reports several errors. If we look for the job in the history manager we can see in the task record that some map and reduce tasks have finished with errors.
The notes for the failed map tasks are:
Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 Too Many fetch failures.Failing the attempt
The notes for the failed reduce tasks are:
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#5 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:333) at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:255) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:308) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
I've tried many executions but the outcome is always the same, there are always some failed tasks, and the job is consuming a great amount of iddle time.
Running the job from the command line we can observe how the completed percentage of the reduce taks is stuck on a certain number (normally between 60% and 80%), then all the exceptions are raised and the job finishes without errors in a few seconds (as I said the normal behavior of the program is to finish in a few minutes).
Just for the record this behaviour is the same for all hadoop jobs I've tried and input data, included the examples (wordcount). I've tried many workarounds from the internet but nothing has solved the problem.
Thank you very much again.
Sorry I do not see how to edit a post from this window (Am I blind?)
Just trying to isolate the error I've decommissioned several nodes, finally I notice that with nodes 3 and 4 offline MapReduce works perfectly, although the network, nodemanager and datanode configuration seems identical.
That's because of the cache.
These posts might be helpful.
1) How the caches work with YARN :
2) How to solve the situation :