Support Questions

Find answers, ask questions, and share your expertise

Crash during installation/configuration using Cloudera Manager 6.1.1 + ansible setup

avatar

Short version

 

cloudera_scm_server is crashing during final configuration.  I reach Add Cluster - Configuration - Command Details and fail on Formatting the name durectories of the current Namenode and Creating Oozie Database Tables.

 

It's a hard failure - the cloudera_scm_server crashes.

 

I'm probably missing some little detail in my setup but haven't found anything yet in either this community or the usual onlne resources.

 

cloudera_scm_server.log shows:

 

2019-03-28 17:04:09,030 INFO SearchRepositoryManager-0:com.cloudera.server.web.cmf.search.components.SearchRepositoryManager: Finished constructing repo:2019-03-28T17:04:09.030Z
2019-03-28 17:04:09,734 WARN scm-web-92:com.cloudera.server.cmf.descriptor.components.DescriptorFactory: Could not generate client configs for service: YARN (MR2 Included)

Caused by: com.cloudera.cmf.service.config.ConfigGenException: Unable to generate config of 'mapreduce.application.framework.path'

 

and

 

2019-03-28 17:04:00,173 INFO WebServerImpl:com.cloudera.server.web.cmf.search.LuceneSearchRepository: Directory /var/lib/cloudera-scm-server/search/lucene.en..1553792502675 does not seem to be a Lucene index (no segments.gen).
2019-03-28 17:04:00,173 WARN WebServerImpl:com.cloudera.server.web.cmf.search.components.SearchRepositoryManager: Failed to initialize search dir, deleting it: /var/lib/cloudera-scm-server/search/lucene.en..1553792575457
org.apache.lucene.index.IndexNotFoundException: no segments* file found in org.apache.lucene.store.MMapDirectory@/var/lib/cloudera-scm-server/search/lucene.en..1553792575457 lockFactory=org.apache.lucene.store.NativeFSLockFactory@7d930261: files: [write.lock, _0.fdt, _0.fdx]

 

System Check Failures

 

There are a few system check failures. I think I included all required RPMs but I'll double-check - for maintainability I've tried to focus on the top-level RPMs and relied on package dependencies to pull in everything it needs. That avoids problems that could happen between releases if the names of those dependencies change, etc.

 

Missing resources:

  • hue plugins
  • keytrustee_kp and keytrustee_server
  • mr1
  • sqoop2

The missing mr1 is esp. suspicious since one of the failure messages refers to YARN.

 

Background / Ansible-based Installation [CentOS 7]

 

We need ansible scripts that can quickly bring up and tear down specific configurations for testing purposes - the idea is that our "big" tests can spin up a dedicated instance, run integration tests against it, and then spin down that instance. We want a fresh instance every time so our tests will have good isolation, and our bean counters will be happy since we're not paying for clusters that we aren't using. Ansible fits into our framework nicely and there's been a lot of pressure to automate the process 100% instead of relying on manual installation/configuration of an AMI image that contains a pre-configured system.

 

I'm most of the way there - I have the ansible plays + roles to:

 

  1. create an EC2 instance
  2. add the Cloudera YUM repos
  3. install postgresql, java, CM packages, and CDH packages
  4. create the required databases and accounts
  5. tweak the system as required (/etc/hosts, etc.)
  6. launch cloudera_scm_server and cloudera_scm_agent

I can log into the manager, select 'managed' node (my new EC2 instance), select 'packages', and work my way through the installation process to Add Cluster - Configuration - Command Details. Of the 7 steps I successfully complete 5. The ones that fail are Formatting the name directories of the current Namenode and Creating Oozie Database Tables.

 

My standalone ansible playbook to create a standard single-node HDFS cluster works fine. I don't think I'm missing anything required to format the name directory although it is possible that I commented out a critical step when converting from the standalone playbook to this one.

 

1 ACCEPTED SOLUTION

avatar
Super Collaborator

If the server crashed with OOM then it may not be sufficiently staffed with resources (RAM). Are you installing to a single node? What are the hardware resources for this host?

 

Please make sure that the YARN service is started up and in good health state before starting Oozie.

 

The HDFS startup error indicates communication issue to CM server, please review the CM server logs if there are any issues shown during this time including Detected pause in JVM messages.

View solution in original post

6 REPLIES 6

avatar
Just curious, is it onprem or in cloud?

avatar

AWS EC2

avatar
I would recommend to deploy it via Cloudera Altus Director, you can find some templates here:
https://github.com/cloudera/director-scripts

Also you can use a fast bootstrap with pre-baked images.

avatar
Thanks, I'll look at it.

Unfortunately we have a mandate to use ansible and full automation to the
largest extent possible. That's because we need to be able to set up a
large variety of configurations to match what our customers use.

A good model is my HDFS playbook. It

1. installs the required YUM packages
2. formats the HDFS filesystem
3. adds the standard test users
4. prepares the Kerberos keytab files (tbd)
5. prepares the SSL keystores (tbd)

and sets the flags for standard mode. We can then easily turn on Kerberos
and/or RPC privacy via plays that modify just a few properties and restart
the services.

There's an HBase playbook that sets up the HBase servers. It can use HDFS
but from the conf files it looks like we could also use a traditional file
and do many of our tests without also setting up a full HDFS node. That
means it will require fewer resources and can run on a smaller instance or
even the dev's laptop.

Since it's all yum and ansible anyone can modify the image without needing
to learn new tools.

TPTB are fine with creating an AMI that only requires updating the crypto
material but they want to be able to rebuild the AMI image from the most
basic resources.

Hmm, I might be able to sell this particular story as an exception. The two
use cases are 1) creating new configurations that we don't have a playbook
for yet and 2) verifying the configuration files for an arbitrary
configuration. This won't be used in the automated tests.

(tbd - I know how to do it. The blocker is reaching a consensus on the best
way to manage the resources so our applications don't require tweaking the
configuration everytime. Do we use a standalone KDC, an integrated solution
like FreeIPA, etc.)

avatar

Summary

 

Improvement, but oozie still fails to initialize and it's possibly OOM. Investigating that possibility.

 

Details

 

I compared 'rpm -qa' and the contents of the yum repo and explicitly added a few missed packages. I don't think they're related, e.g., most were related to impala, but wanted to eliminate all variables.

 

HDFS format worked.


No lucene errors.

 

Cloudera-scm-server did not crash - I'm able to log into the CM dashboard. The warnings seem to be mostly related to the size of the EC2 instance. 

 

However there's still a few problems.

 

OOZIE

 

However oozie initialization is still failing. The stack traces are

 

Latest:


2019-03-29 12:29:17,583 ERROR WebServerImpl:com.cloudera.server.web.cmf.TsqueryAutoCompleter: Error getting predicates
org.apache.avro.AvroRemoteException: java.net.ConnectException: Connection refused (Connection refused)

 at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:104)
at com.sun.proxy.$Proxy179.getImpalaFilterMetadata(Unknown Source)
at com.cloudera.cmf.protocol.firehose.nozzle.TimeoutNozzleIPC.getImpalaFilterMetadata(TimeoutNozzleIPC.java:370)
at com.cloudera.server.web.cmf.impala.components.ImpalaDao.fetchFilterMetadata(ImpalaDao.java:837)

 

Before that:

 

2019-03-29 12:29:07,748 WARN ProcessStalenessDetector-0:com.cloudera.cmf.service.config.components.ProcessStalenessDetector: Encountered exception while performing staleness check
java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to find commissioned ResourceManager in good health
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)

 

2019-03-29 12:28:28,011 INFO main:org.quartz.core.QuartzScheduler: Scheduler meta-data: Quartz Scheduler (v2.0.2) 'com.cloudera.cmf.scheduler-1' with instanceId 'NON_CLUSTERED'
Scheduler class: 'org.quartz.core.QuartzScheduler' - running locally.
NOT STARTED.
Currently in standby mode.
Number of jobs executed: 0
Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 1 threads.
Using job-store 'org.quartz.simpl.RAMJobStore' - which does not support persistence. and is not clustered.

 

The /var/logs/hadoop-yarn directory is empty. Perhaps OOZIE is failing because YARN isn't coming up?

 

HDFS

 

When I start HDFS via CM I get this error:

 

There was an error when communicating with the server. See the server log file, typically /var/log/cloudera-scm-server/cloudera-scm-server.log, for more information.

 

CRASH / RESTART


It looks like the server crashed at this point. The logs show that it's trying to restart but failing. I don't see an explanation. I'm going to attribute that to OOM until I can rule that out.

 

avatar
Super Collaborator

If the server crashed with OOM then it may not be sufficiently staffed with resources (RAM). Are you installing to a single node? What are the hardware resources for this host?

 

Please make sure that the YARN service is started up and in good health state before starting Oozie.

 

The HDFS startup error indicates communication issue to CM server, please review the CM server logs if there are any issues shown during this time including Detected pause in JVM messages.