Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

zookeeper fails with "Failed to initialize ZooKeeper" - looks to be a timeout issue

avatar
Explorer

I am using the cloudera-director-client-1.1.3 to start up a small cluster.  I am having issues with Zookeeper.  I tried to using a single and 3 zookeeper nodes.  Sometimes the script runs, but most of the time it fails. 

 

If I try to access the logs I get:

 

[Errno 2] No such file or directory: '/var/log/zookeeper/zookeeper-cmf-CD-ZOOKEEPER-bsejIGrE-SERVER-ip-10-0-1-41.ec2.internal.log'

 

If I look at the node the folder is empty.

 

Here is the error that I think is causing the issue:

Command ZkInit with ID 56 failed.

resultMessage=Command aborted because of exception: Command timed-out after 90 seconds

 

Is there a way to increase the timeout?  Do I need to turn off the diagnostic data collection (could that be causing an issue)?

 

Here is a snippet of the log file:

 

[2015-07-30 14:40:09] INFO  [pipeline-thread-1] - c.c.launchpad.pipeline.AbstractJob: Creating cluster services
[2015-07-30 14:40:09] INFO  [pipeline-thread-1] - c.c.launchpad.pipeline.AbstractJob: Assigning roles to instances
[2015-07-30 14:40:09] INFO  [pipeline-thread-1] - c.c.l.bootstrap.cluster.AddServices: Creating 11 roles for service CD-HDFS-sxHsiZxW
[2015-07-30 14:40:09] INFO  [pipeline-thread-1] - c.c.l.bootstrap.cluster.AddServices: Creating 6 roles for service CD-YARN-GqpLWLyB
[2015-07-30 14:40:10] INFO  [pipeline-thread-1] - c.c.l.bootstrap.cluster.AddServices: Creating 1 roles for service CD-ZOOKEEPER-bsejIGrE
[2015-07-30 14:40:10] INFO  [pipeline-thread-1] - c.c.l.bootstrap.cluster.AddServices: Creating 5 roles for service CD-HIVE-PJiYBqej
[2015-07-30 14:40:10] INFO  [pipeline-thread-1] - c.c.l.bootstrap.cluster.AddServices: Creating 1 roles for service CD-HUE-GatHLWtp
[2015-07-30 14:40:10] INFO  [pipeline-thread-1] - c.c.l.bootstrap.cluster.AddServices: Creating 1 roles for service CD-OOZIE-nQaBSyxZ
[2015-07-30 14:40:11] INFO  [pipeline-thread-1] - c.c.launchpad.pipeline.AbstractJob: Automatically configuring services and roles
[2015-07-30 14:40:11] INFO  [pipeline-thread-1] - c.c.launchpad.pipeline.AbstractJob: Applying custom configurations of services
[2015-07-30 14:40:12] INFO  [pipeline-thread-1] - c.c.launchpad.pipeline.AbstractJob: Configuring Hive Metastore database
....
[2015-07-30 14:40:25] INFO  [pipeline-thread-1] - c.c.launchpad.pipeline.AbstractJob: Creating Hive Metastore Database
[2015-07-30 14:40:25] INFO  [pipeline-thread-1] - c.c.l.pipeline.util.PipelineRunner: << None{}
[2015-07-30 14:40:28] INFO  [pipeline-thread-1] - c.c.l.pipeline.util.PipelineRunner: >> UnboundedWaitForApiCommand/3 [53, Deployment{name='dev-3 Deployment', hostname='10.0.1.90', port=7180, username='admin', man ...
[2015-07-30 14:40:28] INFO  [pipeline-thread-1] - c.c.l.b.UnboundedWaitForApiCommand: Command CreateHiveDatabase with ID 53 completed successfully. Details: ApiCommand{id=53, name=CreateHiveDatabase, startTime=Thu Jul 30 14:40:19 EDT 2015, endTime=Thu Jul 30 14:40:25 EDT 2015, active=false, success=true, resultMessage=Created Hive Metastore Database., serviceRef=ApiServiceRef{peerName=null, clusterName=dev-3, serviceName=CD-HIVE-PJiYBqej}, roleRef=null, hostRef=null, parent=null}
[2015-07-30 14:40:28] INFO  [pipeline-thread-1] - c.c.l.pipeline.util.PipelineRunner: << None{}
[2015-07-30 14:40:30] INFO  [pipeline-thread-1] - c.c.l.pipeline.util.PipelineRunner: >> SetStatusJob/1 [Waiting on First Run command]
[2015-07-30 14:40:31] INFO  [pipeline-thread-1] - c.c.launchpad.pipeline.AbstractJob: Waiting on First Run command

 

I see an error about collecting diagnostic data&colon;

....

2015-07-30 14:43:20] INFO  [pipeline-thread-1] - c.c.l.b.UnboundedWaitForApiCommand: Collecting and downloading diagnostic data
[2015-07-30 14:43:21] ERROR [pipeline-thread-1] - c.c.l.b.ClouderaManagerLogRetriever: Got exception while collecting diagnostic data
javax.ws.rs.ServiceUnavailableException: null

 

And then it attempts to do the "first run" of the install:

[2015-07-30 14:43:21] WARN  [pipeline-thread-1] - c.c.l.b.UnboundedWaitForApiCommand: Failed to collect diagnostic data
[2015-07-30 14:43:21] ERROR [pipeline-thread-1] - c.c.l.b.UnboundedWaitForApiCommand: Command First Run with ID 54 failed. Details: ApiCommand{id=54, name=First Run, startTime=Thu Jul 30 14:40:20 EDT 2015, endTime=Thu Jul 30 14:43:20 EDT 2015, active=false, success=false, resultMessage=Failed to perform First Run of services., serviceRef=null, roleRef=null, hostRef=null, parent=null}
[2015-07-30 14:43:21] ERROR [pipeline-thread-1] - c.c.l.b.UnboundedWaitForApiCommand: Command ZkInit with ID 56 failed. Details: ApiCommand{id=56, name=ZkInit, startTime=Thu Jul 30 14:41:50 EDT 2015, endTime=Thu Jul 30 14:43:20 EDT 2015, active=false, success=false, resultMessage=Command aborted because of exception: Command timed-out after 90 seconds, serviceRef=ApiServiceRef{peerName=null, clusterName=dssh-dev-3, serviceName=CD-ZOOKEEPER-bsejIGrE}, roleRef=ApiRoleRef{clusterName=dssh-dev-3, serviceName=CD-ZOOKEEPER-bsejIGrE, roleName=CD-ZOOKEEPER-bsejIGrE-SERVER-95a522458bc9844f970bdffc8e1a5c6f}, hostRef=null, parent=null}
[2015-07-30 14:43:21] ERROR [pipeline-thread-1] - c.c.l.pipeline.util.PipelineRunner: Attempt to execute job failed
com.cloudera.launchpad.pipeline.UnrecoverablePipelineError: Cloudera Manager 'First Run' command execution failed: Failed to perform First Run of services.

 

Here is the cluster description:

 

# Cluster description
cluster {
    products {
      CDH: 5
    }
    parcelRepositories: ["http://archive.cloudera.com/cdh5/parcels/5.3.3/"]
    services: [HDFS, YARN, ZOOKEEPER, HIVE, HUE, OOZIE]

    configs {
      ZOOKEEPER {
        zookeeper_datadir_autocreate: true
      }
    }

    masters-1 {
      count: 1
      instance: ${instances.c32x} {
        tags {
          group: master
        }
      }

      roles {
        HDFS: [NAMENODE, GATEWAY]
      }

    }

    masters-2 {
      count: 1
      instance: ${instances.c32x} {
        tags {
          group: master
        }
      }

      roles {
        ZOOKEEPER: [SERVER]
        HDFS: [SECONDARYNAMENODE, GATEWAY]
        YARN: [RESOURCEMANAGER, JOBHISTORY]
      }
    }

    workers {
      count: 2
      minCount: 2
      instance: ${instances.d2x} {
        tags {
          group: worker
        }
      }

      roles {
        HDFS: [GATEWAY, DATANODE]
        HIVE: [GATEWAY]
        YARN: [NODEMANAGER, GATEWAY]
      }
    }

    gateways {
      count: 1
      instance: ${instances.c32x} {
        tags {
          group: gateway
        }
      }

      roles {
        HIVE: [GATEWAY, HIVEMETASTORE, HIVESERVER2]
        HDFS: [GATEWAY, BALANCER, HTTPFS]
        HUE: [HUE_SERVER]
        OOZIE: [OOZIE_SERVER]
      }


    }
}

 

1 ACCEPTED SOLUTION

avatar
Master Collaborator

One suggestion is to give it  a try with gp2 for the root disk volume. That may make a difference. 

 

There is no explicit way of increasing that timeout. It can be done by increasing the agent heartbeat but that has important implications on other core Cloudera Manager services and features. That timeout during FirstRun is usually a sign of poor AWS performance. We faced a similar challenge in the past in our own testing and had to find ways to use different AWS accounts to get better isolation. 

View solution in original post

5 REPLIES 5

avatar
Master Collaborator

Are you using c3.2xlarge for your cluster? (I see ${instances.c32c} as a reference) What's the size of the root disk partition? Are there many instances under this AWS account? 

avatar
Explorer

The root volumes are 200 GB:

rootVolumeSizeGB: 200 # defaults to 50 GB if not specified
rootVolumeType: standard # gp2 for SSD OR standard (for EBS magnetic)

 

For the manager/name/edge nodes I am using this type (4 nodes):

c32x {
type: c3.2xlarge # requires an HVM AMI
image: ami-00a11e68
tags {

...

 

And for the data/yarn nodes I am using this type (2 nodes):

d2x {
#data node (3 drives of 2TB each)
type: d2.xlarge # requires an HVM AMI
image: ami-00a11e68

 

Yes, we have started up 100's of instances before (lots of EMR jobs), so capacity is not an issue.  The nodes are running and I can ssh into all instances :-).

 

The launcher seems to fail when starting up zookeeper.  I can go to the Cloudera Manager (CM) and start zookeeper and it will start, but by that time the steps to configure the cluster has to be executed manually. In CM I can go to each service/role and run their tasks and the cluster starts ok.  I follow all the steps listed in the run list - no other errors (I do get the warning that I need to change the replication factor from 3 (since I'm only starting up 2 hdfs nodes).  I had that on my service config for HDFS but removed it thinking maybe it was causing an issue.

 

Do I need to list roles in a specific order? It looks like Zookeeper is the first service that always starts up first, so I assume it configures CM using the API and then lists the tasks to run in the correct order.    

 

avatar
Master Collaborator

One suggestion is to give it  a try with gp2 for the root disk volume. That may make a difference. 

 

There is no explicit way of increasing that timeout. It can be done by increasing the agent heartbeat but that has important implications on other core Cloudera Manager services and features. That timeout during FirstRun is usually a sign of poor AWS performance. We faced a similar challenge in the past in our own testing and had to find ways to use different AWS accounts to get better isolation. 

avatar
Explorer

Yes, I used gp2 instead of "standard" and it completed ok.  I ran the script Saturday morning, so I don't know if it was the change that I made or the fact that I ran it on the weekend (less activity).  

 

I will need to run the script several more times over the next week or two.  If I encounter issues then I'll be sure to report back.  But for now I'm hoping this change addressed my issue.  

 

Thanks! 

avatar
Explorer

Changing the disk type to gp2 looks to have solved the issue. I have recreated the cluster 2 times and it has not failed.  

 

Thanks!