Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Stale CLOSE_WAIT Mapreduce tcp sockets

Stale CLOSE_WAIT Mapreduce tcp sockets

New Contributor

Hello,

 

When everytime a mapreduce task creates tmpjar files, it randomly fails like below:

 

java.io.IOException: Got error for OP_READ_BLOCK, self=/?.?.?.12:46600, remote=/?.?.?.13:50010, for file /user-snip-/blah-4.2.1.jar, for pool BP-638361846-?.?.?.11-1374812160397 block -5619127225927466355_25894291

        at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:429)

        at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:394)

        at org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:137)

        at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:1103)

        at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:538)

        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:750)

        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:794)

        at java.io.DataInputStream.read(DataInputStream.java:83)

        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:78)

        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:52)

        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112)

        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:260)

        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:232)

        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:183)

        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2031)

        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2000)

        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1976)

        at org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(TrackerDistributedCacheManager.java:445)

        at org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(JobLocalizer.java:334)

        at org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(JobLocalizer.java:352)

        at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:397)

        at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:376)

        at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:231)

        at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1420)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:396)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)

        at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1395)

        at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1310)

        at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2728)

        at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2692)

 

The program is, the socket created for failed task will never get closed, it stays there in CLOSE_WAIT status and eventually take all the FDs in the end.

 

So far the only way we can try for this issue is periodically restart tasktracker process.

 

Is anyone knows any solution for this?

 

We are using CDH 4.3.1-1.cdh4.3.1.p0.110.

 

Thanks in advance.

1 REPLY 1
Highlighted

Re: Stale CLOSE_WAIT Mapreduce tcp sockets

Master Collaborator

@Hungry I have emailed out Mapreduce SME team requesting assistance for you on this thread, somebody should be replying soon.

 

Regards,