Created 11-17-2014 04:28 PM
Getting timeouts and 'suspended due to failure' message while resizing root file system for large (1TB) volumes. This probably deserves a much higher timeout especially considering that 16TB volumes will be availble soon.
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: resize2fs 1.41.12 (17-May-2010)
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: Filesystem at /dev/xvde is mounted on /; on-line resizing required
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: old desc_blocks = 59, new_desc_blocks = 64
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: Performing an on-line resize of /dev/xvde to 268435456 (4k) blocks.
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: resize2fs: Device or resource busy While trying to extend the last group
[2014-11-18 00:22:20] ERROR [pipeline-thread-23] - c.c.l.p.DatabasePipelineRunner: Attempt to execute job failed
com.cloudera.launchpad.common.ssh.SshException: Script execution failed with code 1. Script: sudo resize2fs $(sudo mount | grep "on / type" | awk '{ print $1 }')
at com.cloudera.launchpad.pipeline.ssh.SshJobFailFastWithOutputLogging.run(SshJobFailFastWithOutputLogging.java:47) ~[launchpad-pipeline-common-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.ssh.SshJobFailFastWithOutputLogging.run(SshJobFailFastWithOutputLogging.java:27) ~[launchpad-pipeline-common-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.job.Job3.runUnchecked(Job3.java:32) ~[launchpad-pipeline-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner$1.call(DatabasePipelineRunner.java:229) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78) ~[guava-retrying-1.0.6.jar!/:na]
at com.github.rholder.retry.Retryer.call(Retryer.java:110) ~[guava-retrying-1.0.6.jar!/:na]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.attemptMultipleJobExecutionsWithRetries(DatabasePipelineRunner.java:213) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.run(DatabasePipelineRunner.java:132) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.6.0_33]
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) ~[na:1.6.0_33]
at java.util.concurrent.FutureTask.run(FutureTask.java:166) ~[na:1.6.0_33]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) ~[na:1.6.0_33]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.6.0_33]
at java.lang.Thread.run(Thread.java:701) ~[na:1.6.0_33]
Created 11-17-2014 04:37 PM
Currently Director considers this type of failures as being unrecoverable. We are planning to make it possible to retry in a future release but that wouldn't solve this failure.
For your usecase what you probably need to do is to increase the SSH read timeout by changing the following configuration: lp.ssh.readTimeoutInSeconds either as a command line argument or by editing the configuration files under /etc/cloudera-director-server or /etc/cloudera-director (used when running in standalone mode).
The read timeout should be larger than the amount of time it takes to perform that resize operation.
Created 11-17-2014 04:31 PM
Is there anyway to recover instances that were 'suspended due to failure'?
Created 11-17-2014 04:37 PM
Currently Director considers this type of failures as being unrecoverable. We are planning to make it possible to retry in a future release but that wouldn't solve this failure.
For your usecase what you probably need to do is to increase the SSH read timeout by changing the following configuration: lp.ssh.readTimeoutInSeconds either as a command line argument or by editing the configuration files under /etc/cloudera-director-server or /etc/cloudera-director (used when running in standalone mode).
The read timeout should be larger than the amount of time it takes to perform that resize operation.
Created 11-17-2014 04:40 PM
Do I need to restart the director service after changing this?
What would happen if I tried adding the nodes back in via the manager 'add new hosts to cluster'?
Created 11-17-2014 04:43 PM
A restart is required after changing any configuration option from that file.
Created 11-17-2014 05:11 PM
> What would happen if I tried adding the nodes back in via the manager 'add new hosts to cluster'?
The behavior is undefined. Director needs to do that on its own. It can't deal with external cluster topology modifications.
Created 11-17-2014 04:38 PM
How are you using those 1TB volumes? By default Director will try to use ephemeral storage for HDFS.
Created 11-17-2014 04:42 PM
I was not aware that director would only use ephemeral storage, and I don't believe that was ever mentioned in the documentation. This will be a long lived cluster for on-going jobs that require several terabytes worth of storage.
Created 11-17-2014 04:47 PM
Even for long running clusters ephemeral storage is better from a performance perspective. You already get replication from HDFS. Are you using EBS as way to recover from instance failure?
Created 11-17-2014 04:53 PM
EBS was chosen mostly due to the storage requirements, and we found that the hit to IO performance was acceptable for now given the cost vs having 3-4 times as many instances required to store our working sets. All data is backed up to s3.