Created 04-25-2021 01:12 AM
Recently, MapReduce job sometimes failed, the details as below:
after check map tasks , the log like below:
Log Type: syslog
Log Upload Time: Sun Apr 25 13:54:17 +0800 2021
Log Length: 5507
2021-04-25 13:51:01,806 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2021-04-25 13:51:01,893 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2021-04-25 13:51:01,893 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started 2021-04-25 13:51:01,895 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens: 2021-04-25 13:51:01,895 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1618548626214_99981, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@732c2a62) 2021-04-25 13:51:02,182 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now. 2021-04-25 13:51:06,267 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-3/10.39.58.16:1079. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2021-04-25 13:51:10,268 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-3/10.39.58.16:1079. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2021-04-25 13:51:14,268 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-3/10.39.58.16:1079. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2021-04-25 13:51:18,268 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-3/10.39.58.16:1079. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2021-04-25 13:51:22,269 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-3/10.39.58.16:1079. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2021-04-25 13:51:26,270 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-3/10.39.58.16:1079. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2021-04-25 13:51:30,271 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-3/10.39.58.16:1079. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2021-04-25 13:51:34,272 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-3/10.39.58.16:1079. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2021-04-25 13:51:38,272 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-3/10.39.58.16:1079. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2021-04-25 13:51:42,272 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-3/10.39.58.16:1079. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2021-04-25 13:51:45,274 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From dataware-14/10.39.58.19 to dataware-3:1079 failed on connection exception: java.net.ConnectException: Connection timed out; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731) at org.apache.hadoop.ipc.Client.call(Client.java:1508) at org.apache.hadoop.ipc.Client.call(Client.java:1441) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:246) at com.sun.proxy.$Proxy9.getTask(Unknown Source) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:132) Caused by: java.net.ConnectException: Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:648) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:744) at org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:396) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1557) at org.apache.hadoop.ipc.Client.call(Client.java:1480) ... 4 more 2021-04-25 13:51:45,275 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system... 2021-04-25 13:51:45,275 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped. 2021-04-25 13:51:45,275 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete.
from the above log, we can see the task connection timeout with App master, but this error happend randomly, who can give me some advises on this error.
thanks.
Created 04-28-2021 01:30 AM
this issue has been solved right now. and the investigation road likes below:
when i got this issue from development team, these peoples told me some tasks will be failed, and asked me how to solve it.
then i open Yarn web ui to check what's exact errors of this issue, and found the connection time out. this is the first vision i have got.
so i was considering why the port can't connect ? maybe there is a firewall ? or maybe one machine got some problem, when task assigned to this machine, then this issue happended?
these all are my assumption, and after two days checked, the answer is no. since no firewall, and this issue happended randomly on every machine.
just yesterday night, i found if the connection port is near 1000, then the job failed and connection timeout, but if the port is near 30000+, there are no any issue happend.
so i am going to check the sysctl.conf, i found the setting for port range is "net.ipv4.ip_local_port_range = 1024 65000",
at last i set the port range between "32678. 655000", this issue has been solved.
Created 04-25-2021 06:39 AM
Hello @iamfromsky
Upon checking the shared logs I can see that one of your Nodemanager(dataware-14/10.39.58.19) is trying to connect a Nodemanager dataware-3 on port 1079 and it got failed with Connection timed out. You may need to check Nodemanager log of dataware-14, dataware-3 and active resource manager log around 2021-04-25 13:51:45 to find more insight.
The connection timed out could happen because of a couple of reasons and it is listed in https://cwiki.apache.org/confluence/display/HADOOP2/ConnectionRefused. It is highly recommended to cross-check all the steps mentioned in the link.
There is a high chance of network issues on these nodes. Please check with Linux admin team as well.
Created 04-25-2021 08:55 AM
you are right, the connection is between two Nodemanager, and i assume dataware-3 :1079 is app master, another one is a task. that's why i said the connection timeout is from task to appmaster.
since this kind error just happend randomly, and 1 time per hour, so it's really hard for me to find out root cause.
Created 04-26-2021 07:43 PM
Generally a container which ends with 000001(example:container_XXX_XXXXXXXXXXXXX_XXXXXX_XX_000001) is the Application Master.
Just cross-check whether you are checking the same to confirm it is an AM.
Created 04-27-2021 03:01 PM
Log Type: syslog
Log Upload Time: Wed Apr 28 03:31:04 +0800 2021
Log Length: 132219
2021-04-28 03:27:36,319 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for application appattempt_1618548626214_128739_000001
2021-04-28 03:27:36,530 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Executing with tokens:
2021-04-28 03:27:36,530 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind: YARN_AM_RM_TOKEN, Service: , Ident: (org.apache.hadoop.yarn.security.AMRMTokenIdentifier@5b218417)
2021-04-28 03:27:36,706 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: OutputCommitter set in config org.apache.hadoop.hive.ql.io.HiveFileFormatUtils$NullOutputCommitter
2021-04-28 03:27:36,708 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: OutputCommitter is org.apache.hadoop.hive.ql.io.HiveFileFormatUtils$NullOutputCommitter
2021-04-28 03:27:37,232 WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-04-28 03:27:37,378 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.jobhistory.EventType for class org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler
2021-04-28 03:27:37,379 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.JobEventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher
2021-04-28 03:27:37,380 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.TaskEventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskEventDispatcher
2021-04-28 03:27:37,381 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.TaskAttemptEventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher
2021-04-28 03:27:37,381 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventType for class org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler
2021-04-28 03:27:37,385 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.speculate.Speculator$EventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$SpeculatorEventDispatcher
2021-04-28 03:27:37,386 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.rm.ContainerAllocator$EventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter
2021-04-28 03:27:37,386 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncher$EventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerLauncherRouter
2021-04-28 03:27:37,430 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://nameservice1:8020]
2021-04-28 03:27:37,449 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://nameservice1:8020]
2021-04-28 03:27:37,469 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://nameservice1:8020]
2021-04-28 03:27:37,481 INFO [main] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Emitting job history data to the timeline server is not enabled
2021-04-28 03:27:37,513 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.JobFinishEvent$Type for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler
2021-04-28 03:27:37,673 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2021-04-28 03:27:37,724 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2021-04-28 03:27:37,725 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MRAppMaster metrics system started
2021-04-28 03:27:37,735 INFO [main] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Adding job token for job_1618548626214_128739 to jobTokenSecretManager
2021-04-28 03:27:37,855 INFO [main] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Not uberizing job_1618548626214_128739 because: not enabled; too much RAM;
2021-04-28 03:27:37,877 INFO [main] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Input size for job job_1618548626214_128739 = 23256534. Number of splits = 7
2021-04-28 03:27:37,877 INFO [main] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Number of reduces for job job_1618548626214_128739 = 0
2021-04-28 03:27:37,877 INFO [main] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1618548626214_128739Job Transitioned from NEW to INITED
2021-04-28 03:27:37,878 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster launching normal, non-uberized, multi-container job job_1618548626214_128739.
2021-04-28 03:27:37,905 INFO [main] org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 100
2021-04-28 03:27:37,914 INFO [Socket Reader #1 for port 9115] org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 9115
2021-04-28 03:27:37,951 INFO [main] org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding protocol org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB to the server
2021-04-28 03:27:37,952 INFO [IPC Server Responder] org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2021-04-28 03:27:37,952 INFO [IPC Server listener on 9115] org.apache.hadoop.ipc.Server: IPC Server listener on 9115: starting
2021-04-28 03:27:37,953 INFO [main] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Instantiated MRClientService at dataware-14/10.39.58.19:9115
2021-04-28 03:27:38,009 INFO [main] org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2021-04-28 03:27:38,015 INFO [main] org.apache.hadoop.security.authentication.server.AuthenticationFilter: Unable to initialize FileSignerSecretProvider, falling back to use random secrets.
2021-04-28 03:27:38,019 INFO [main] org.apache.hadoop.http.HttpRequestLog: Http request log for http.requests.mapreduce is not defined
2021-04-28 03:27:38,027 INFO [main] org.apache.hadoop.http.HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
2021-04-28 03:27:38,072 INFO [main] org.apache.hadoop.http.HttpServer2: Added filter AM_PROXY_FILTER (class=org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter) to context mapreduce
2021-04-28 03:27:38,074 INFO [main] org.apache.hadoop.http.HttpServer2: Added filter AM_PROXY_FILTER (class=org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter) to context static
2021-04-28 03:27:38,077 INFO [main] org.apache.hadoop.http.HttpServer2: adding path spec: /mapreduce/*
2021-04-28 03:27:38,077 INFO [main] org.apache.hadoop.http.HttpServer2: adding path spec: /ws/*
2021-04-28 03:27:38,086 INFO [main] org.apache.hadoop.http.HttpServer2: Jetty bound to port 41305
2021-04-28 03:27:38,086 INFO [main] org.mortbay.log: jetty-6.1.26.cloudera.4
2021-04-28 03:27:38,120 INFO [main] org.mortbay.log: Extract jar:file:/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/jars/hadoop-yarn-common-2.6.0-cdh5.13.3.jar!/webapps/mapreduce to /tmp/Jetty_0_0_0_0_41305_mapreduce____2p8bem/webapp
2021-04-28 03:27:38,436 INFO [main] org.mortbay.log: Started HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:41305
2021-04-28 03:27:38,437 INFO [main] org.apache.hadoop.yarn.webapp.WebApps: Web app /mapreduce started at 41305
2021-04-28 03:27:38,742 INFO [main] org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules
2021-04-28 03:27:38,745 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator: JOB_CREATE job_1618548626214_128739
2021-04-28 03:27:38,748 INFO [main] org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 3000
2021-04-28 03:27:38,749 INFO [Socket Reader #1 for port 1983] org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 1983
2021-04-28 03:27:38,753 INFO [IPC Server Responder] org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2021-04-28 03:27:38,753 INFO [IPC Server listener on 1983] org.apache.hadoop.ipc.Server: IPC Server listener on 1983: starting
2021-04-28 03:27:38,775 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: nodeBlacklistingEnabled:true
2021-04-28 03:27:38,775 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: maxTaskFailuresPerNode is 3
2021-04-28 03:27:38,775 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: blacklistDisablePercent is 33
2021-04-28 03:27:38,848 INFO [main] org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to rm237
2021-04-28 03:27:38,877 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: maxContainerCapability: <memory:24576, vCores:14>
2021-04-28 03:27:38,877 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: queue: root.etl_core
2021-04-28 03:27:38,881 INFO [main] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Upper limit on the thread pool size is 500
2021-04-28 03:27:38,881 INFO [main] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: The thread pool initial size is 10
2021-04-28 03:27:38,889 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1618548626214_128739Job Transitioned from INITED to SETUP
2021-04-28 03:27:38,893 INFO [CommitterEvent Processor #0] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: JOB_SETUP
2021-04-28 03:27:38,895 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1618548626214_128739Job Transitioned from SETUP to RUNNING
2021-04-28 03:27:38,970 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1618548626214_128739_m_000000 Task Transitioned from NEW to SCHEDULED
2021-04-28 03:27:38,988 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Event Writer setup for JobId: job_1618548626214_128739, File: hdfs://nameservice1:8020/user/hive/.staging/job_1618548626214_128739/job_1618548626214_128739_1.jhist
2021-04-28 03:27:38,994 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1618548626214_128739_m_000001 Task Transitioned from NEW to SCHEDULED
2021-04-28 03:27:39,013 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1618548626214_128739_m_000002 Task Transitioned from NEW to SCHEDULED
2021-04-28 03:27:39,032 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1618548626214_128739_m_000003 Task Transitioned from NEW to SCHEDULED
2021-04-28 03:27:39,049 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1618548626214_128739_m_000004 Task Transitioned from NEW to SCHEDULED
2021-04-28 03:27:39,077 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1618548626214_128739_m_000005 Task Transitioned from NEW to SCHEDULED
2021-04-28 03:27:39,095 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1618548626214_128739_m_000006 Task Transitioned from NEW to SCHEDULED
2021-04-28 03:27:39,097 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1618548626214_128739_m_000000_0 TaskAttempt Transitioned from NEW to UNASSIGNED
2021-04-28 03:27:39,097 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1618548626214_128739_m_000001_0 TaskAttempt Transitioned from NEW to UNASSIGNED
2021-04-28 03:27:39,098 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1618548626214_128739_m_000002_0 TaskAttempt Transitioned from NEW to UNASSIGNED
2021-04-28 03:27:39,098 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1618548626214_128739_m_000003_0 TaskAttempt Transitioned from NEW to UNASSIGNED
2021-04-28 03:27:39,098 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1618548626214_128739_m_000004_0 TaskAttempt Transitioned from NEW to UNASSIGNED
2021-04-28 03:27:39,098 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1618548626214_128739_m_000005_0 TaskAttempt Transitioned from NEW to UNASSIGNED
2021-04-28 03:27:39,098 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1618548626214_128739_m_000006_0 TaskAttempt Transitioned from NEW to UNASSIGNED
2021-04-28 03:27:39,099 INFO [Thread-53] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: mapResourceRequest:<memory:6144, vCores:1>
please help me check the port:1983, everytime when the job failed, retry connection port is 1983, after several times then job failed since connection timeout.
Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-28 03:27:59,247 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-14/10.39.58.19:1983. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-28 03:28:03,247 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-14/10.39.58.19:1983. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-28 03:28:07,248 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-14/10.39.58.19:1983. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-28 03:28:11,249 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-14/10.39.58.19:1983. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-28 03:28:15,250 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-14/10.39.58.19:1983. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-28 03:28:19,251 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-14/10.39.58.19:1983. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-28 03:28:23,253 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: dataware-14/10.39.58.19:1983. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2021-04-28 03:28:26,258 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From dataware-17/10.39.58.15 to dataware-14:1983 failed on connection exception: java.net.ConnectException: Connection timed out; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
at org.apache.hadoop.ipc.Client.call(Client.java:1508)
at org.apache.hadoop.ipc.Client.call(Client.java:1441)
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:246)
at com.sun.proxy.$Proxy9.getTask(Unknown Source)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:132)
Caused by: java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:648)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:744)
at org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:396)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1557)
at org.apache.hadoop.ipc.Client.call(Client.java:1480)
... 4 more
Created 04-27-2021 03:02 PM
the port: 1983 is application master's port or not ? i am not sure about that
Created 04-28-2021 01:30 AM
this issue has been solved right now. and the investigation road likes below:
when i got this issue from development team, these peoples told me some tasks will be failed, and asked me how to solve it.
then i open Yarn web ui to check what's exact errors of this issue, and found the connection time out. this is the first vision i have got.
so i was considering why the port can't connect ? maybe there is a firewall ? or maybe one machine got some problem, when task assigned to this machine, then this issue happended?
these all are my assumption, and after two days checked, the answer is no. since no firewall, and this issue happended randomly on every machine.
just yesterday night, i found if the connection port is near 1000, then the job failed and connection timeout, but if the port is near 30000+, there are no any issue happend.
so i am going to check the sysctl.conf, i found the setting for port range is "net.ipv4.ip_local_port_range = 1024 65000",
at last i set the port range between "32678. 655000", this issue has been solved.
Created 04-30-2021 05:28 AM
@iamfromsky Thanks for sharing the details.
Created 05-05-2021 07:52 PM
in my pervious experience, i haven't ever set the port range, he default port is 32768 ---65536. so my only question is why 1000~ port can't be connected ? could you give me some information?