Support Questions
Find answers, ask questions, and share your expertise

Data node failed to restart after a reboot node

Data node failed to restart after a reboot node

New Contributor

Hi,

 

After a reboot, I have got this error to try to restart a datanode :"Could not get disk usage information".

Any idea in how to solve this problem?

 

 

 

4 REPLIES 4
Highlighted

Re: Data node failed to restart after a reboot node

Cloudera Employee

It seems, something is wrong with disks. Please check with Linux admin team to confirm if datanode disks are healthy.

Highlighted

Re: Data node failed to restart after a reboot node

New Contributor

I checked all the file system with the command "du -sk" and the response was ok.

 

I tried to restart again the Datanode and now the error is another:

 

Successfully obtained privileged resources (streaming port = ServerSocket[addr=/0.0.0.0,localport=1004] ) (http listener port = 1006)
Opened info server at /0.0.0.0:1006
Starting regular datanode initialization
log4j:ERROR Failed to flush writer,
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:59)
at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:324)
at org.apache.log4j.RollingFileAppender.subAppend(RollingFileAppender.java:276)
at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.apache.commons.logging.impl.Log4JLogger.info(Log4JLogger.java:176)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.releaseShortCircuitFds(DataXceiver.java:408)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReleaseShortCircuitFds(Receiver.java:236)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:124)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
at java.lang.Thread.run(Thread.java:748)
Service exit with a return value of 143
Initializing secure datanode resources
Opened streaming server at /0.0.0.0:1004
Successfully obtained privileged resources (streaming port = ServerSocket[addr=/0.0.0.0,localport=1004] ) (http listener port = 1006)
Opened info server at /0.0.0.0:1006
Starting regular datanode initialization
Service exit with a return value of 143
Initializing secure datanode resources

 

I checked the ports LISTEN on the node to check the socket 1006 and there is no process running:

 

[root@ithbda106 ~]# netstat -netupa | egrep "1004|1006" | grep LIST

<<< No return >>

 

How can I get the solution for this?

 

 

 

 

Highlighted

Re: Data node failed to restart after a reboot node

Cloudera Employee

From the shared error stack, I can see the below error:

 

Starting regular datanode initialization
log4j:ERROR Failed to flush writer,
java.io.IOException: No space left on device

 

Please check if there is any disk space issue on these datanodes. 

Highlighted

Re: Data node failed to restart after a reboot node

New Contributor

Hi.

 

Tis hadoop cluster hacer 1.4PB size, so for this node we have this situation size on the Mount points:

 

[root@ithbda108 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/md2 459G 50G 385G 12% /
tmpfs 126G 36K 126G 1% /dev/shm
/dev/md0 453M 77M 349M 19% /boot
/dev/sda4 6.6T 6.3T 323G 96% /u01
/dev/sdb4 6.6T 6.3T 321G 96% /u02
/dev/sdc1 7.1T 6.8T 314G 96% /u03
/dev/sdd1 7.1T 6.8T 314G 96% /u04
/dev/sde1 7.1T 6.8T 318G 96% /u05
/dev/sdf1 7.1T 6.8T 323G 96% /u06
/dev/sdg1 7.1T 6.8T 325G 96% /u07
/dev/sdh1 7.1T 6.8T 323G 96% /u08
/dev/sdi1 7.1T 6.8T 324G 96% /u09
/dev/sdj1 7.1T 6.8T 324G 96% /u10
/dev/sdk1 7.1T 6.8T 324G 96% /u11
/dev/sdl1 7.1T 6.8T 322G 96% /u12
cm_processes 126G 200M 126G 1% /var/run/cloudera-scm-agent/process
ithbda103.sopbda.telcel.com:/opt/exportdir
459G 338G 98G 78% /opt/shareddir

 

I suppose that it can be an issue about space disk and there is no space left on the device at the time of writing into log4j.

 

Any idea in what action can we do to solve the space left on the mount points? 

Some cloudera procedure to optimize that the process can be up?