Created on 05-25-2017 11:29 PM - edited 05-25-2017 11:41 PM
Hi,
I'm installing CDH 5.11 with Cloudera Manager on Cento 7.2, I'm using single user mode with the defualt account cloudera-scm. Everything is OK but failed when I enabled HA.
My clusters is like this:
roc-master: name node
roc-secondary: secondary
roc-5, roc-s1, roc-s2: journal node
roc-[1-6]: data node
when I enabled HA, I selected roc-secondary as the other name node.
the error message is as follows on the enable HA page:
Failed to initialize Shared Edits Directory of NameNode NameNode (roc-master). Initialization can fail if the Shared Edits Directory is not empty. Check the stderr log for details.: Error found before invoking supervisord: Non-root agent cannot execute process as user 'hdfs'.
Then the progress hang at start the name node with the following error message:
2017-05-26 02:27:25,996 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data/d1/dfs/nn/in_use.lock acquired by nodename 20238@roc-master
2017-05-26 02:27:26,607 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [192.168.0.47:8485, 192.168.0.69:8485, 192.168.0.51:8485]. Skipping.
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
192.168.0.47:8485: Journal Storage Directory /data/d1/dfs/jn/scorpio not formatted
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkFormatted(Journal.java:472)
at org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:655)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:186)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:236)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2220)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2214)
192.168.0.69:8485: Journal Storage Directory /data/d1/dfs/jn/scorpio not formatted
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkFormatted(Journal.java:472)
at org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:655)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:186)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:236)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2220)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2214)
the cloudera-scm-agent.log, the log is as follows:
[26/May/2017 02:27:08 +0000] 17557 MainThread process INFO Deactivating process 1764-hdfs-NAMENODE-format
[26/May/2017 02:27:08 +0000] 17557 MainThread util INFO Using generic audit plugin for process namenode-initialize-shared-edits
[26/May/2017 02:27:08 +0000] 17557 MainThread util INFO Creating metadata plugin for process namenode-initialize-shared-edits
[26/May/2017 02:27:08 +0000] 17557 MainThread util INFO Using specific metadata plugin for process namenode-initialize-shared-edits
[26/May/2017 02:27:08 +0000] 17557 MainThread util INFO Using generic metadata plugin for process namenode-initialize-shared-edits
[26/May/2017 02:27:08 +0000] 17557 MainThread agent INFO [1765-namenode-initialize-shared-edits] Instantiating process
[26/May/2017 02:27:08 +0000] 17557 MainThread process INFO [1765-namenode-initialize-shared-edits] Updating process: True {}
[26/May/2017 02:27:08 +0000] 17557 MainThread agent ERROR Failed to activate {u'refresh_files': [], u'config_generation': 0,u'auto_restart': False, u'running': True, u'required_tags': [u'cdh'], u'one_off': True, u'special_file_info': [], u'group': u'hdfs', u'id': 1765, u'status_links': {}, u'name': u'namenode-initialize-shared-edits', u'extra_groups': [], u'run_generation': 1, u'start_timeout_seconds': None, u'environment': {u'HADOOP_AUDIT_LOGGER': u'INFO,RFAAUDIT', u'CM_ADD_TO_CP_DIRS': u'navigator/cdh57', u'HADOOP_NAMENODE_OPTS':u'-Xms4294967296 -Xmx4294967296 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hdfs_hdfs-NAMENODE-0f2bd0a3cf1534ca0c41e5c9cb5266fa_pid{{PID}}.hprof -XX:OnOutOfMemoryError={{AGENT_COMMON_DIR}}/killparent.sh', u'HADOOP_SECURITY_LOGGER': u'INFO,RFAS', u'HADOOP_CREDSTORE_PASSWORD': u'ifzbuyq7pv4key60lzjxjszy', u'HADOOP_LOG_DIR': u'/var/log/hadoop-hdfs', u'HADOOP_ROOT_LOGGER': u'INFO,console', u'HADOOP_LOGFILE': u'hadoop-cmf-hdfs-NAMENODE-roc-master.log.out', u'CDH_VERSION': u'5'}, u'optional_tags': [u'cdh-plugin', u'hdfs-plugin'], u'program': u'hdfs/hdfs.sh', u'arguments': [u'initializeSharedEdits'], u'parcels': {u'SPARK2': u'2.1.0.cloudera1-1.cdh5.7.0.p0.120904', u'CDH': u'5.11.0-1.cdh5.11.0.p0.34', u'ACCUMULO': u'1.7.2-5.5.0.ACCUMULO5.5.0.p0.8', u'KAFKA': u'2.1.1-1.2.1.1.p0.18'}, u'resources': [], u'user': u'hdfs'}
Traceback (most recent call last):
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.11.0-py2.7.egg/cmf/agent.py", line 1702, in handle_heartbeat_p
rocesses
new_process.update_heartbeat(raw, True)
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.11.0-py2.7.egg/cmf/process.py", line 304, in update_heartbeat
self.fs_update()
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.11.0-py2.7.egg/cmf/process.py", line 426, in fs_update
raise Exception("Non-root agent cannot execute process as user '%s'" % user)
Exception: Non-root agent cannot execute process as user 'hdfs'
It seems like an user permission problem, so I checked the agent process:
clouder+ 17557 1 1 02:20 ? 00:07:14 python2.7 /usr/lib64/cmf/agent/build/env/bin/cmf-agent --package_dir /usr/lib64/cmf/service --agent_dir /var/run/cloudera-scm-agent --lib_dir /var/lib/cloudera-scm-agent --logfile /var/log/cloudera-scm-agent/cloudera-scm-agent.log --daemon --comm_name cmf-agent --pidfile /var/run/cloudera-scm-agent/cloudera-scm-agent.pid
the agent is running as the user cloudera-scm, and in the /etc/sudoers I've specified the following line:
%cloudera-scm ALL=(ALL) NOPASSWD: ALL
But the log shows when initial the shared edits, it was doing with another user 'hdfs'. I checked the code in /usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.11.0-py2.7.egg/cmf/process.py, seems like it checks the agent's user (here is cloudera-scm) and the user who does the shared edits initialization and found cloudera-scm is neither 'hdfs' nor 'root', so it raised the exception "Exception: Non-root agent cannot execute process as user 'hdfs'".
I also checked the shared edits folder /data/d1/dfs/jn, indeed it is empty.
I'm not sure that, I'm using the singer user mode (cloudera-scm), why it changes to 'hdfs' to executes the initialization?
How to fix the problem to enable HA in my case?
Thanks,
MH
Created 05-26-2017 09:58 PM
Found a way to bypass the user permission check:
modify the file "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.11.0-py2.7.egg/cmf/process.py", at line 422-423, you can see:
user = self.raw["user"]
group = self.raw["group"]
add the following two lines after group:
user = <uesr_of_the_single_user_mode> if user == "hdfs" else user
group = <group_of_the_single_user> if group == "hdfs" else group
then backup the process.pyc and process.pyo in the same path, remove them.
restart the cloudera-scm-agent, then enable HA, it succeed. The share edits on journal node was formatted correctly.
Then recover the process.py and restart the agent.
But still want to know why it use "hdfs" to execute the initialization here.
Thanks,
MH
Created 07-12-2018 09:27 AM