Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Bootstrap times out while resizing root partition for large disks

avatar
Expert Contributor

Getting timeouts and 'suspended due to failure' message while resizing root file system for large (1TB) volumes. This probably deserves a much higher timeout especially considering that 16TB volumes will be availble soon.

 

[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: resize2fs 1.41.12 (17-May-2010)
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: Filesystem at /dev/xvde is mounted on /; on-line resizing required
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: old desc_blocks = 59, new_desc_blocks = 64
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: Performing an on-line resize of /dev/xvde to 268435456 (4k) blocks.
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: resize2fs: Device or resource busy While trying to extend the last group
[2014-11-18 00:22:20] ERROR [pipeline-thread-23] - c.c.l.p.DatabasePipelineRunner: Attempt to execute job failed
com.cloudera.launchpad.common.ssh.SshException: Script execution failed with code 1. Script: sudo resize2fs $(sudo mount | grep "on / type" | awk '{ print $1 }')
at com.cloudera.launchpad.pipeline.ssh.SshJobFailFastWithOutputLogging.run(SshJobFailFastWithOutputLogging.java:47) ~[launchpad-pipeline-common-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.ssh.SshJobFailFastWithOutputLogging.run(SshJobFailFastWithOutputLogging.java:27) ~[launchpad-pipeline-common-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.job.Job3.runUnchecked(Job3.java:32) ~[launchpad-pipeline-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner$1.call(DatabasePipelineRunner.java:229) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78) ~[guava-retrying-1.0.6.jar!/:na]
at com.github.rholder.retry.Retryer.call(Retryer.java:110) ~[guava-retrying-1.0.6.jar!/:na]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.attemptMultipleJobExecutionsWithRetries(DatabasePipelineRunner.java:213) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.run(DatabasePipelineRunner.java:132) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.6.0_33]
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) ~[na:1.6.0_33]
at java.util.concurrent.FutureTask.run(FutureTask.java:166) ~[na:1.6.0_33]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) ~[na:1.6.0_33]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.6.0_33]
at java.lang.Thread.run(Thread.java:701) ~[na:1.6.0_33]

1 ACCEPTED SOLUTION

avatar
Master Collaborator

Currently Director considers this type of failures as being unrecoverable. We are planning to make it possible to retry in a future release but that wouldn't solve this failure. 

 

For your usecase what you probably need to do is to increase the SSH read timeout by changing the following configuration: lp.ssh.readTimeoutInSeconds either as a command line argument or by editing the configuration files under /etc/cloudera-director-server or /etc/cloudera-director (used when running in standalone mode).

 

The read timeout should be larger than the amount of time it takes to perform that resize operation. 

View solution in original post

12 REPLIES 12

avatar
Expert Contributor

Is there anyway to recover instances that were 'suspended due to failure'? 

avatar
Master Collaborator

Currently Director considers this type of failures as being unrecoverable. We are planning to make it possible to retry in a future release but that wouldn't solve this failure. 

 

For your usecase what you probably need to do is to increase the SSH read timeout by changing the following configuration: lp.ssh.readTimeoutInSeconds either as a command line argument or by editing the configuration files under /etc/cloudera-director-server or /etc/cloudera-director (used when running in standalone mode).

 

The read timeout should be larger than the amount of time it takes to perform that resize operation. 

avatar
Expert Contributor

Do I need to restart the director service after changing this?

 

What would happen if I tried adding the nodes back in via the manager 'add new hosts to cluster'?

avatar
Master Collaborator

A restart is required after changing any configuration option from that file. 

avatar
Master Collaborator

What would happen if I tried adding the nodes back in via the manager 'add new hosts to cluster'? 

 

The behavior is undefined. Director needs to do that on its own. It can't deal with external cluster topology modifications. 

avatar
Master Collaborator

How are you using those 1TB volumes? By default Director will try to use ephemeral storage for HDFS. 

avatar
Expert Contributor

I was not aware that director would only use ephemeral storage, and I don't believe that was ever mentioned in the documentation. This will be a long lived cluster for on-going jobs that require several terabytes worth of storage.

avatar
Master Collaborator

Even for long running clusters ephemeral storage is better from a performance perspective. You already get replication from HDFS. Are you using EBS as way to recover from instance failure? 

avatar
Expert Contributor

EBS was chosen mostly due to the storage requirements, and we found that the hit to IO performance was acceptable for now given the cost vs having 3-4 times as many instances required to store our working sets. All data is backed up to s3.