Reply
Highlighted
New Contributor
Posts: 1
Registered: ‎10-09-2013

Stale CLOSE_WAIT Mapreduce tcp sockets

Hello,

 

When everytime a mapreduce task creates tmpjar files, it randomly fails like below:

 

java.io.IOException: Got error for OP_READ_BLOCK, self=/?.?.?.12:46600, remote=/?.?.?.13:50010, for file /user-snip-/blah-4.2.1.jar, for pool BP-638361846-?.?.?.11-1374812160397 block -5619127225927466355_25894291

        at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:429)

        at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:394)

        at org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:137)

        at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:1103)

        at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:538)

        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:750)

        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:794)

        at java.io.DataInputStream.read(DataInputStream.java:83)

        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:78)

        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:52)

        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112)

        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:260)

        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:232)

        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:183)

        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2031)

        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2000)

        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1976)

        at org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(TrackerDistributedCacheManager.java:445)

        at org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(JobLocalizer.java:334)

        at org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(JobLocalizer.java:352)

        at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:397)

        at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:376)

        at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:231)

        at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1420)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:396)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)

        at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1395)

        at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1310)

        at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2728)

        at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2692)

 

The program is, the socket created for failed task will never get closed, it stays there in CLOSE_WAIT status and eventually take all the FDs in the end.

 

So far the only way we can try for this issue is periodically restart tasktracker process.

 

Is anyone knows any solution for this?

 

We are using CDH 4.3.1-1.cdh4.3.1.p0.110.

 

Thanks in advance.

Posts: 416
Topics: 51
Kudos: 89
Solutions: 49
Registered: ‎06-26-2013

Re: Stale CLOSE_WAIT Mapreduce tcp sockets

@Hungry I have emailed out Mapreduce SME team requesting assistance for you on this thread, somebody should be replying soon.

 

Regards,