Created 11-17-2014 04:28 PM
Getting timeouts and 'suspended due to failure' message while resizing root file system for large (1TB) volumes. This probably deserves a much higher timeout especially considering that 16TB volumes will be availble soon.
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: resize2fs 1.41.12 (17-May-2010)
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: Filesystem at /dev/xvde is mounted on /; on-line resizing required
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: old desc_blocks = 59, new_desc_blocks = 64
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: Performing an on-line resize of /dev/xvde to 268435456 (4k) blocks.
[2014-11-18 00:22:20] INFO [io-thread-9] - ssh:172.31.12.253: resize2fs: Device or resource busy While trying to extend the last group
[2014-11-18 00:22:20] ERROR [pipeline-thread-23] - c.c.l.p.DatabasePipelineRunner: Attempt to execute job failed
com.cloudera.launchpad.common.ssh.SshException: Script execution failed with code 1. Script: sudo resize2fs $(sudo mount | grep "on / type" | awk '{ print $1 }')
at com.cloudera.launchpad.pipeline.ssh.SshJobFailFastWithOutputLogging.run(SshJobFailFastWithOutputLogging.java:47) ~[launchpad-pipeline-common-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.ssh.SshJobFailFastWithOutputLogging.run(SshJobFailFastWithOutputLogging.java:27) ~[launchpad-pipeline-common-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.job.Job3.runUnchecked(Job3.java:32) ~[launchpad-pipeline-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner$1.call(DatabasePipelineRunner.java:229) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78) ~[guava-retrying-1.0.6.jar!/:na]
at com.github.rholder.retry.Retryer.call(Retryer.java:110) ~[guava-retrying-1.0.6.jar!/:na]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.attemptMultipleJobExecutionsWithRetries(DatabasePipelineRunner.java:213) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.run(DatabasePipelineRunner.java:132) ~[launchpad-pipeline-database-1.0.1.jar!/:1.0.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.6.0_33]
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) ~[na:1.6.0_33]
at java.util.concurrent.FutureTask.run(FutureTask.java:166) ~[na:1.6.0_33]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) ~[na:1.6.0_33]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.6.0_33]
at java.lang.Thread.run(Thread.java:701) ~[na:1.6.0_33]
Created 11-17-2014 04:37 PM
Currently Director considers this type of failures as being unrecoverable. We are planning to make it possible to retry in a future release but that wouldn't solve this failure.
For your usecase what you probably need to do is to increase the SSH read timeout by changing the following configuration: lp.ssh.readTimeoutInSeconds either as a command line argument or by editing the configuration files under /etc/cloudera-director-server or /etc/cloudera-director (used when running in standalone mode).
The read timeout should be larger than the amount of time it takes to perform that resize operation.
Created 11-17-2014 05:00 PM
I understand. My suggestion would be to evaluate some of the first / second generation AWS instance types with magnetic drives for ephemeral storage. For example m1.xlarge has 1.6TB of storage. My guess is that you will get better performance but I don't know if that's acceptable from a cost perspective. With regards to backups - S3 is the way to go unless you have a very strict SLA that requires another online cluster.
Created 11-17-2014 05:04 PM
If you decide to use ephemeral storage (it's free - included in the instance price) then a 50-100GB root disk drive will probably work just fine.
Created 11-17-2014 05:05 PM
Instances with magnetic ephermeral is probably not a bad solution to this, but I would probably want more cores than they provid. I will investigate further after my team has inspected some of the nodes to ensure all the libraries and components we require are installed properly.