Support Questions

Find answers, ask questions, and share your expertise

Containers in datanode showing connection error

avatar
Contributor

We tried running teragen in a 5 node cluster, this time using Hadoop 2.7.1. The task is stuck at map 50%, reduce 0%.

When viewed the logs for this job on a datanode it showed this error :-

ntainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:57252. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-03 21:33:12,426 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:57252. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-03 21:33:13,430 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:57252. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-03 21:33:14,437 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:57252. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-03 21:33:15,447 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:57252. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-03 21:33:16,448 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:57252. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-03 21:33:27,450 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:57252. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-03 21:33:28,452 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:57252. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-03 21:33:29,456 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:57252. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-03 21:33:30,463 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:57252. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-03 21:33:31,465 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:57252. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-03 21:33:32,476 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:57252. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-03 21:33:33,485 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:57252. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-03 21:33:34,493 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:57252. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-03 21:33:35,493 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:57252. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-03 21:33:36,494 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:57252. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

the yarn-site.xml on all datanode is as follows :-

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->

 <property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
 </property>
 <property>
 <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
 <value>org.apache.hadoop.mapred.ShuffleHandler</value>
 </property>
<property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>hadoop-master:8030</value>
</property>
<property>
    <name>yarn.resourcemanager.address</name>
    <value>hadoop-master:8032</value>
</property>
<property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>hadoop-master:8088</value>
</property>
<property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>hadoop-master:8031</value>
</property>
<property>
    <name>yarn.resourcemanager.admin.address</name>
    <value>hadoop-master:8033</value>
</property>

</configuration>
2 REPLIES 2

avatar

@Kumar Sanyam

Your DataNode cannot connect to your Name Node:

So either:

  1. Your NameNode went down
  2. Your firewall prevents you from connecting from Data Node to Name Node
  3. your /etc/hosts is wrong. Remove any reference to 127.0.0.1. Ensure name Node FDQN is there. Do a netstat and ensure that Name Noe is listening to right port.
  4. The fs.default.name property in core-site.xml doesn't point to the correct hostname for the Name Node

avatar
New Contributor

You can try to disable your firewall. Below commands should help:

service iptables stop
service ip6tables stop