Created on 03-28-2019 11:09 AM - edited 09-16-2022 07:16 AM
Short version
cloudera_scm_server is crashing during final configuration. I reach Add Cluster - Configuration - Command Details and fail on Formatting the name durectories of the current Namenode and Creating Oozie Database Tables.
It's a hard failure - the cloudera_scm_server crashes.
I'm probably missing some little detail in my setup but haven't found anything yet in either this community or the usual onlne resources.
cloudera_scm_server.log shows:
2019-03-28 17:04:09,030 INFO SearchRepositoryManager-0:com.cloudera.server.web.cmf.search.components.SearchRepositoryManager: Finished constructing repo:2019-03-28T17:04:09.030Z
2019-03-28 17:04:09,734 WARN scm-web-92:com.cloudera.server.cmf.descriptor.components.DescriptorFactory: Could not generate client configs for service: YARN (MR2 Included)
Caused by: com.cloudera.cmf.service.config.ConfigGenException: Unable to generate config of 'mapreduce.application.framework.path'
and
2019-03-28 17:04:00,173 INFO WebServerImpl:com.cloudera.server.web.cmf.search.LuceneSearchRepository: Directory /var/lib/cloudera-scm-server/search/lucene.en..1553792502675 does not seem to be a Lucene index (no segments.gen).
2019-03-28 17:04:00,173 WARN WebServerImpl:com.cloudera.server.web.cmf.search.components.SearchRepositoryManager: Failed to initialize search dir, deleting it: /var/lib/cloudera-scm-server/search/lucene.en..1553792575457
org.apache.lucene.index.IndexNotFoundException: no segments* file found in org.apache.lucene.store.MMapDirectory@/var/lib/cloudera-scm-server/search/lucene.en..1553792575457 lockFactory=org.apache.lucene.store.NativeFSLockFactory@7d930261: files: [write.lock, _0.fdt, _0.fdx]
System Check Failures
There are a few system check failures. I think I included all required RPMs but I'll double-check - for maintainability I've tried to focus on the top-level RPMs and relied on package dependencies to pull in everything it needs. That avoids problems that could happen between releases if the names of those dependencies change, etc.
Missing resources:
The missing mr1 is esp. suspicious since one of the failure messages refers to YARN.
Background / Ansible-based Installation [CentOS 7]
We need ansible scripts that can quickly bring up and tear down specific configurations for testing purposes - the idea is that our "big" tests can spin up a dedicated instance, run integration tests against it, and then spin down that instance. We want a fresh instance every time so our tests will have good isolation, and our bean counters will be happy since we're not paying for clusters that we aren't using. Ansible fits into our framework nicely and there's been a lot of pressure to automate the process 100% instead of relying on manual installation/configuration of an AMI image that contains a pre-configured system.
I'm most of the way there - I have the ansible plays + roles to:
I can log into the manager, select 'managed' node (my new EC2 instance), select 'packages', and work my way through the installation process to Add Cluster - Configuration - Command Details. Of the 7 steps I successfully complete 5. The ones that fail are Formatting the name directories of the current Namenode and Creating Oozie Database Tables.
My standalone ansible playbook to create a standard single-node HDFS cluster works fine. I don't think I'm missing anything required to format the name directory although it is possible that I commented out a critical step when converting from the standalone playbook to this one.
Created 03-29-2019 06:22 AM
If the server crashed with OOM then it may not be sufficiently staffed with resources (RAM). Are you installing to a single node? What are the hardware resources for this host?
Please make sure that the YARN service is started up and in good health state before starting Oozie.
The HDFS startup error indicates communication issue to CM server, please review the CM server logs if there are any issues shown during this time including Detected pause in JVM messages.
Created 03-29-2019 01:07 AM
Created 03-29-2019 05:30 AM
AWS EC2
Created 03-29-2019 05:59 AM
Created 03-29-2019 07:29 AM
Created 03-29-2019 05:56 AM
Summary
Improvement, but oozie still fails to initialize and it's possibly OOM. Investigating that possibility.
Details
I compared 'rpm -qa' and the contents of the yum repo and explicitly added a few missed packages. I don't think they're related, e.g., most were related to impala, but wanted to eliminate all variables.
HDFS format worked.
No lucene errors.
Cloudera-scm-server did not crash - I'm able to log into the CM dashboard. The warnings seem to be mostly related to the size of the EC2 instance.
However there's still a few problems.
OOZIE
However oozie initialization is still failing. The stack traces are
Latest:
2019-03-29 12:29:17,583 ERROR WebServerImpl:com.cloudera.server.web.cmf.TsqueryAutoCompleter: Error getting predicates
org.apache.avro.AvroRemoteException: java.net.ConnectException: Connection refused (Connection refused)
at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:104)
at com.sun.proxy.$Proxy179.getImpalaFilterMetadata(Unknown Source)
at com.cloudera.cmf.protocol.firehose.nozzle.TimeoutNozzleIPC.getImpalaFilterMetadata(TimeoutNozzleIPC.java:370)
at com.cloudera.server.web.cmf.impala.components.ImpalaDao.fetchFilterMetadata(ImpalaDao.java:837)
Before that:
2019-03-29 12:29:07,748 WARN ProcessStalenessDetector-0:com.cloudera.cmf.service.config.components.ProcessStalenessDetector: Encountered exception while performing staleness check
java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to find commissioned ResourceManager in good health
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
2019-03-29 12:28:28,011 INFO main:org.quartz.core.QuartzScheduler: Scheduler meta-data: Quartz Scheduler (v2.0.2) 'com.cloudera.cmf.scheduler-1' with instanceId 'NON_CLUSTERED'
Scheduler class: 'org.quartz.core.QuartzScheduler' - running locally.
NOT STARTED.
Currently in standby mode.
Number of jobs executed: 0
Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 1 threads.
Using job-store 'org.quartz.simpl.RAMJobStore' - which does not support persistence. and is not clustered.
The /var/logs/hadoop-yarn directory is empty. Perhaps OOZIE is failing because YARN isn't coming up?
HDFS
When I start HDFS via CM I get this error:
There was an error when communicating with the server. See the server log file, typically /var/log/cloudera-scm-server/cloudera-scm-server.log, for more information.
CRASH / RESTART
It looks like the server crashed at this point. The logs show that it's trying to restart but failing. I don't see an explanation. I'm going to attribute that to OOM until I can rule that out.
Created 03-29-2019 06:22 AM
If the server crashed with OOM then it may not be sufficiently staffed with resources (RAM). Are you installing to a single node? What are the hardware resources for this host?
Please make sure that the YARN service is started up and in good health state before starting Oozie.
The HDFS startup error indicates communication issue to CM server, please review the CM server logs if there are any issues shown during this time including Detected pause in JVM messages.