Reply
Explorer
Posts: 11
Registered: ‎06-30-2014

1 TB TeraGen fails to run with the following error

Hi,

 

I have a following setup:

I have a Vittual Machine Setup. I use KVM. Each VM is a node. 16 of such nodes create one cluster. The stoarge is shared among all of these 16 VMs. I have /hadoop/dfs/dn as datanode directory on all the nodes. Each node has been allocated around 867 GB of storage. So df -h gives me following

 /dev/vdb          867G  222M  822G   1% /hadoop.

 

 

Basically, 16 nodes combined I have 822*16 GB available.

 

I'm running following Test:

I'm running TeraGen from the examples with /hadoop/hdfs/teragen/ as the output directory to store the data generated by TeraGen.

 

Everything is fine if I run TeraGen to generate 100GB of data. But, It fails (all datanodes start to fail after around 46% maps are done) for 1 TB data. (Replication factor is set to 1 in both cases)

 

Here is log:

-----------------

emImpl: Scheduled snapshot period at 10 second(s).
2014-11-03 20:21:56,362 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started
2014-11-03 20:21:56,373 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2014-11-03 20:21:56,373 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1415067515269_0001, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@189a6c56)
2014-11-03 20:21:56,451 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2014-11-03 20:21:56,724 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /hadoop/yarn/nm/usercache/hdfs/appcache/application_1415067515269_0001
2014-11-03 20:21:56,836 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2014-11-03 20:21:56,848 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2014-11-03 20:21:57,303 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2014-11-03 20:21:57,729 INFO [main] org.apache.hadoop.mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2014-11-03 20:21:57,956 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: org.apache.hadoop.examples.terasort.TeraGen$RangeInputFormat$RangeInputSplit@2179e0ae
2014-11-03 20:32:49,289 WARN [ResponseProcessor for block BP-829674508-10.1.10.100-1412725091201:blk_1073814799_74159] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-829674508-10.1.10.100-1412725091201:blk_1073814799_74159
java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.1.10.112:36633 remote=/10.1.10.112:50010]
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
	at java.io.FilterInputStream.read(FilterInputStream.java:83)
	at java.io.FilterInputStream.read(FilterInputStream.java:83)
	at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1986)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:796)
2014-11-03 20:32:49,291 INFO [main] org.apache.hadoop.mapred.MapTask: Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector@3bc7c477
java.io.IOException: All datanodes 10.1.10.112:50010 are bad. Aborting...
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-11-03 20:32:49,292 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException: All datanodes 10.1.10.112:50010 are bad. Aborting...
2014-11-03 20:32:49,292 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: All datanodes 10.1.10.112:50010 are bad. Aborting...
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)

2014-11-03 20:32:49,294 INFO [main] org.apache.hadoop.mapred.Task: Runnning cleanup for the task

 

Any thoghts on why this is happening? How to debug this? How to make sure if the data is getting copied to all of these nodes. Storage shouldn't be a problem as I have more than 1 TB of space on 16 nodes combined(or do you have a different opinion?). I have even increaed the open files count handled on linux  from 1024 to 16384, but, no success.

 

Really appreciate your help. 

 

Thanks

Explorer
Posts: 7
Registered: ‎10-21-2017

Re: 1 TB TeraGen fails to run with the following error

Hi Kewal, were you able to solve this issue. I am also getting similar error message. Also tried increasing ulimit, but did not help.

Announcements