Support Questions

jwm · ‎11-17-2014

Getting timeouts and 'suspended due to failure' message while resizing root file system for large (1TB) volumes. This probably deserves a much higher timeout especially considering that 16TB volumes will be availble soon.

[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: resize2fs 1.41.12 (17-May-2010)
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: Filesystem at /dev/xvde is mounted on /; on-line resizing required
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: old desc_blocks = 59, new_desc_blocks = 64
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: Performing an on-line resize of /dev/xvde to 268435456 (4k) blocks.
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: resize2fs: Device or resource busy While trying to extend the last group
[2014-11-18 00:22:20] ERROR [pipeline-thread-23] - c.c.l.p.DatabasePipelineRunner: Attempt to execute job failed
com.cloudera.launchpad.common.ssh.SshException: Script execution failed with code 1. Script: sudo resize2fs $(sudo mount | grep "on / type" | awk '{ print $1 }')
at com.cloudera.launchpad.pipeline.ssh.SshJobFailFastWithOutputLogging.run(SshJobFailFastWithOutputLogging.java:47) ~[launchpad-pipeline-common-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.ssh.SshJobFailFastWithOutputLogging.run(SshJobFailFastWithOutputLogging.java:27) ~[launchpad-pipeline-common-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.job.Job3.runUnchecked(Job3.java:32) ~[launchpad-pipeline-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner$1.call(DatabasePipelineRunner.java:229) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78) ~[guava-retrying-1.0.6.jar!/:na]
at com.github.rholder.retry.Retryer.call(Retryer.java:110) ~[guava-retrying-1.0.6.jar!/:na]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.attemptMultipleJobExecutionsWithRetries(DatabasePipelineRunner.java:213) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.run(DatabasePipelineRunner.java:132) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.6.0_33]
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) ~[na:1.6.0_33]
at java.util.concurrent.FutureTask.run(FutureTask.java:166) ~[na:1.6.0_33]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) ~[na:1.6.0_33]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.6.0_33]
at java.lang.Thread.run(Thread.java:701) ~[na:1.6.0_33]

Andrei Savu · ‎11-17-2014

Currently Director considers this type of failures as being unrecoverable. We are planning to make it possible to retry in a future release but that wouldn't solve this failure.

For your usecase what you probably need to do is to increase the SSH read timeout by changing the following configuration: lp.ssh.readTimeoutInSeconds either as a command line argument or by editing the configuration files under /etc/cloudera-director-server or /etc/cloudera-director (used when running in standalone mode).

The read timeout should be larger than the amount of time it takes to perform that resize operation.

View solution in original post

jwm · ‎11-17-2014

Is there anyway to recover instances that were 'suspended due to failure'?

Andrei Savu · ‎11-17-2014

Currently Director considers this type of failures as being unrecoverable. We are planning to make it possible to retry in a future release but that wouldn't solve this failure.

For your usecase what you probably need to do is to increase the SSH read timeout by changing the following configuration: lp.ssh.readTimeoutInSeconds either as a command line argument or by editing the configuration files under /etc/cloudera-director-server or /etc/cloudera-director (used when running in standalone mode).

The read timeout should be larger than the amount of time it takes to perform that resize operation.

jwm · ‎11-17-2014

Do I need to restart the director service after changing this?

What would happen if I tried adding the nodes back in via the manager 'add new hosts to cluster'?

Andrei Savu · ‎11-17-2014

A restart is required after changing any configuration option from that file.

Andrei Savu · ‎11-17-2014

> What would happen if I tried adding the nodes back in via the manager 'add new hosts to cluster'?

The behavior is undefined. Director needs to do that on its own. It can't deal with external cluster topology modifications.

Andrei Savu · ‎11-17-2014

How are you using those 1TB volumes? By default Director will try to use ephemeral storage for HDFS.

jwm · ‎11-17-2014

I was not aware that director would only use ephemeral storage, and I don't believe that was ever mentioned in the documentation. This will be a long lived cluster for on-going jobs that require several terabytes worth of storage.

Andrei Savu · ‎11-17-2014

Even for long running clusters ephemeral storage is better from a performance perspective. You already get replication from HDFS. Are you using EBS as way to recover from instance failure?

jwm · ‎11-17-2014

EBS was chosen mostly due to the storage requirements, and we found that the hit to IO performance was acceptable for now given the cost vs having 3-4 times as many instances required to store our working sets. All data is backed up to s3.

Cloudera Community

Support Questions

Bootstrap times out while resizing root partition for large disks