question Re: Bootstrap times out while resizing root partition for large disks in Archives of Support Questions (Read Only)

Bootstrap times out while resizing root partition for large disks

jwm — Tue, 18 Nov 2014 00:28:57 GMT

Getting timeouts and 'suspended due to failure' message while resizing root file system for large (1TB) volumes. This probably deserves a much higher timeout especially considering that 16TB volumes will be availble soon.

[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: resize2fs 1.41.12 (17-May-2010)
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: Filesystem at /dev/xvde is mounted on /; on-line resizing required
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: old desc_blocks = 59, new_desc_blocks = 64
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: Performing an on-line resize of /dev/xvde to 268435456 (4k) blocks.
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: resize2fs: Device or resource busy While trying to extend the last group
[2014-11-18 00:22:20] ERROR [pipeline-thread-23] - c.c.l.p.DatabasePipelineRunner: Attempt to execute job failed
com.cloudera.launchpad.common.ssh.SshException: Script execution failed with code 1. Script: sudo resize2fs $(sudo mount | grep "on / type" | awk '{ print $1 }')
at com.cloudera.launchpad.pipeline.ssh.SshJobFailFastWithOutputLogging.run(SshJobFailFastWithOutputLogging.java:47) ~[launchpad-pipeline-common-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.ssh.SshJobFailFastWithOutputLogging.run(SshJobFailFastWithOutputLogging.java:27) ~[launchpad-pipeline-common-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.job.Job3.runUnchecked(Job3.java:32) ~[launchpad-pipeline-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner$1.call(DatabasePipelineRunner.java:229) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78) ~[guava-retrying-1.0.6.jar!/:na]
at com.github.rholder.retry.Retryer.call(Retryer.java:110) ~[guava-retrying-1.0.6.jar!/:na]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.attemptMultipleJobExecutionsWithRetries(DatabasePipelineRunner.java:213) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.run(DatabasePipelineRunner.java:132) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.6.0_33]
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) ~[na:1.6.0_33]
at java.util.concurrent.FutureTask.run(FutureTask.java:166) ~[na:1.6.0_33]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) ~[na:1.6.0_33]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.6.0_33]
at java.lang.Thread.run(Thread.java:701) ~[na:1.6.0_33]

Re: Bootstrap times out while resizing root partition for large disks

jwm — Tue, 18 Nov 2014 00:31:09 GMT

Is there anyway to recover instances that were 'suspended due to failure'?

Re: Bootstrap times out while resizing root partition for large disks

Andrei Savu — Tue, 18 Nov 2014 00:37:03 GMT

Currently Director considers this type of failures as being unrecoverable. We are planning to make it possible to retry in a future release but that wouldn't solve this failure.

For your usecase what you probably need to do is to increase the SSH read timeout by changing the following configuration: lp.ssh.readTimeoutInSeconds either as a command line argument or by editing the configuration files under /etc/cloudera-director-server or /etc/cloudera-director (used when running in standalone mode).

The read timeout should be larger than the amount of time it takes to perform that resize operation.

Re: Bootstrap times out while resizing root partition for large disks

Andrei Savu — Tue, 18 Nov 2014 00:38:57 GMT

How are you using those 1TB volumes? By default Director will try to use ephemeral storage for HDFS.

Re: Bootstrap times out while resizing root partition for large disks

jwm — Tue, 18 Nov 2014 00:40:18 GMT

Do I need to restart the director service after changing this?

What would happen if I tried adding the nodes back in via the manager 'add new hosts to cluster'?

Re: Bootstrap times out while resizing root partition for large disks

jwm — Tue, 18 Nov 2014 00:42:27 GMT

I was not aware that director would only use ephemeral storage, and I don't believe that was ever mentioned in the documentation. This will be a long lived cluster for on-going jobs that require several terabytes worth of storage.

Re: Bootstrap times out while resizing root partition for large disks

Andrei Savu — Tue, 18 Nov 2014 00:43:38 GMT

A restart is required after changing any configuration option from that file.

Re: Bootstrap times out while resizing root partition for large disks

Andrei Savu — Tue, 18 Nov 2014 00:47:44 GMT

Even for long running clusters ephemeral storage is better from a performance perspective. You already get replication from HDFS. Are you using EBS as way to recover from instance failure?

Re: Bootstrap times out while resizing root partition for large disks

jwm — Tue, 18 Nov 2014 00:53:01 GMT

EBS was chosen mostly due to the storage requirements, and we found that the hit to IO performance was acceptable for now given the cost vs having 3-4 times as many instances required to store our working sets. All data is backed up to s3.

Re: Bootstrap times out while resizing root partition for large disks

Andrei Savu — Tue, 18 Nov 2014 01:00:58 GMT

I understand. My suggestion would be to evaluate some of the first / second generation AWS instance types with magnetic drives for ephemeral storage. For example m1.xlarge has 1.6TB of storage. My guess is that you will get better performance but I don't know if that's acceptable from a cost perspective. With regards to backups - S3 is the way to go unless you have a very strict SLA that requires another online cluster.

Re: Bootstrap times out while resizing root partition for large disks

Andrei Savu — Tue, 18 Nov 2014 01:04:18 GMT

If you decide to use ephemeral storage (it's free - included in the instance price) then a 50-100GB root disk drive will probably work just fine.

Re: Bootstrap times out while resizing root partition for large disks

jwm — Tue, 18 Nov 2014 01:05:02 GMT

Instances with magnetic ephermeral is probably not a bad solution to this, but I would probably want more cores than they provid. I will investigate further after my team has inspected some of the nodes to ensure all the libraries and components we require are installed properly.

Re: Bootstrap times out while resizing root partition for large disks

Andrei Savu — Tue, 18 Nov 2014 01:11:08 GMT

> What would happen if I tried adding the nodes back in via the manager 'add new hosts to cluster'?

The behavior is undefined. Director needs to do that on its own. It can't deal with external cluster topology modifications.