Support Questions

Find answers, ask questions, and share your expertise

Exception Journal Storage Directory /data/d1/dfs/jn/scorpio not formatted when enable HA

avatar
New Contributor

Hi,

I'm installing CDH 5.11 with Cloudera Manager on Cento 7.2, I'm using single user mode with the defualt account cloudera-scm. Everything is OK but failed when I enabled HA.

My clusters is like this:
roc-master: name node
roc-secondary: secondary
roc-5, roc-s1, roc-s2: journal node
roc-[1-6]: data node

when I enabled HA, I selected roc-secondary as the other name node.

the error message is as follows on the enable HA page:


Failed to initialize Shared Edits Directory of NameNode NameNode (roc-master). Initialization can fail if the Shared Edits Directory is not empty. Check the stderr log for details.: Error found before invoking supervisord: Non-root agent cannot execute process as user 'hdfs'.


Then the progress hang at start the name node with the following error message:


2017-05-26 02:27:25,996 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data/d1/dfs/nn/in_use.lock acquired by nodename 20238@roc-master
2017-05-26 02:27:26,607 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [192.168.0.47:8485, 192.168.0.69:8485, 192.168.0.51:8485]. Skipping.
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
192.168.0.47:8485: Journal Storage Directory /data/d1/dfs/jn/scorpio not formatted
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkFormatted(Journal.java:472)
at org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:655)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:186)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:236)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2220)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2214)

192.168.0.69:8485: Journal Storage Directory /data/d1/dfs/jn/scorpio not formatted
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkFormatted(Journal.java:472)
at org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:655)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:186)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:236)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2220)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2214)

the cloudera-scm-agent.log, the log is as follows:

[26/May/2017 02:27:08 +0000] 17557 MainThread process INFO Deactivating process 1764-hdfs-NAMENODE-format
[26/May/2017 02:27:08 +0000] 17557 MainThread util INFO Using generic audit plugin for process namenode-initialize-shared-edits
[26/May/2017 02:27:08 +0000] 17557 MainThread util INFO Creating metadata plugin for process namenode-initialize-shared-edits
[26/May/2017 02:27:08 +0000] 17557 MainThread util INFO Using specific metadata plugin for process namenode-initialize-shared-edits
[26/May/2017 02:27:08 +0000] 17557 MainThread util INFO Using generic metadata plugin for process namenode-initialize-shared-edits
[26/May/2017 02:27:08 +0000] 17557 MainThread agent INFO [1765-namenode-initialize-shared-edits] Instantiating process
[26/May/2017 02:27:08 +0000] 17557 MainThread process INFO [1765-namenode-initialize-shared-edits] Updating process: True {}
[26/May/2017 02:27:08 +0000] 17557 MainThread agent ERROR Failed to activate {u'refresh_files': [], u'config_generation': 0,u'auto_restart': False, u'running': True, u'required_tags': [u'cdh'], u'one_off': True, u'special_file_info': [], u'group': u'hdfs', u'id': 1765, u'status_links': {}, u'name': u'namenode-initialize-shared-edits', u'extra_groups': [], u'run_generation': 1, u'start_timeout_seconds': None, u'environment': {u'HADOOP_AUDIT_LOGGER': u'INFO,RFAAUDIT', u'CM_ADD_TO_CP_DIRS': u'navigator/cdh57', u'HADOOP_NAMENODE_OPTS':u'-Xms4294967296 -Xmx4294967296 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hdfs_hdfs-NAMENODE-0f2bd0a3cf1534ca0c41e5c9cb5266fa_pid{{PID}}.hprof -XX:OnOutOfMemoryError={{AGENT_COMMON_DIR}}/killparent.sh', u'HADOOP_SECURITY_LOGGER': u'INFO,RFAS', u'HADOOP_CREDSTORE_PASSWORD': u'ifzbuyq7pv4key60lzjxjszy', u'HADOOP_LOG_DIR': u'/var/log/hadoop-hdfs', u'HADOOP_ROOT_LOGGER': u'INFO,console', u'HADOOP_LOGFILE': u'hadoop-cmf-hdfs-NAMENODE-roc-master.log.out', u'CDH_VERSION': u'5'}, u'optional_tags': [u'cdh-plugin', u'hdfs-plugin'], u'program': u'hdfs/hdfs.sh', u'arguments': [u'initializeSharedEdits'], u'parcels': {u'SPARK2': u'2.1.0.cloudera1-1.cdh5.7.0.p0.120904', u'CDH': u'5.11.0-1.cdh5.11.0.p0.34', u'ACCUMULO': u'1.7.2-5.5.0.ACCUMULO5.5.0.p0.8', u'KAFKA': u'2.1.1-1.2.1.1.p0.18'}, u'resources': [], u'user': u'hdfs'}
Traceback (most recent call last):
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.11.0-py2.7.egg/cmf/agent.py", line 1702, in handle_heartbeat_p
rocesses
new_process.update_heartbeat(raw, True)
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.11.0-py2.7.egg/cmf/process.py", line 304, in update_heartbeat
self.fs_update()
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.11.0-py2.7.egg/cmf/process.py", line 426, in fs_update
raise Exception("Non-root agent cannot execute process as user '%s'" % user)
Exception: Non-root agent cannot execute process as user 'hdfs'


It seems like an user permission problem, so I checked the agent process:

clouder+ 17557 1 1 02:20 ? 00:07:14 python2.7 /usr/lib64/cmf/agent/build/env/bin/cmf-agent --package_dir /usr/lib64/cmf/service --agent_dir /var/run/cloudera-scm-agent --lib_dir /var/lib/cloudera-scm-agent --logfile /var/log/cloudera-scm-agent/cloudera-scm-agent.log --daemon --comm_name cmf-agent --pidfile /var/run/cloudera-scm-agent/cloudera-scm-agent.pid

the agent is running as the user cloudera-scm, and in the /etc/sudoers I've specified the following line:
%cloudera-scm ALL=(ALL) NOPASSWD: ALL

But the log shows when initial the shared edits, it was doing with another user 'hdfs'. I checked the code in /usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.11.0-py2.7.egg/cmf/process.py, seems like it checks the agent's user (here is cloudera-scm) and the user who does the shared edits initialization and found cloudera-scm is neither 'hdfs' nor 'root', so it raised the exception "Exception: Non-root agent cannot execute process as user 'hdfs'".
I also checked the shared edits folder /data/d1/dfs/jn, indeed it is empty.


I'm not sure that, I'm using the singer user mode (cloudera-scm), why it changes to 'hdfs' to executes the initialization?
How to fix the problem to enable HA in my case?


Thanks,

MH

2 REPLIES 2

avatar
New Contributor

Found a way to bypass the user permission check:

modify the file "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.11.0-py2.7.egg/cmf/process.py",  at line 422-423, you can see: 

 

user = self.raw["user"]
group = self.raw["group"]

 

add the following two lines after group:

 

user = <uesr_of_the_single_user_mode> if user == "hdfs" else user
group = <group_of_the_single_user> if group == "hdfs" else group

 

then backup the process.pyc and process.pyo in the same path, remove them.

restart the cloudera-scm-agent, then enable HA, it succeed. The share edits on journal node was formatted correctly.

 

Then recover the process.py and restart the agent.

 

 

But still want to know why it use "hdfs" to execute the initialization here.

 

Thanks,

MH

avatar
New Contributor

Dear mfeng your advice saved my day!

Really much appreciated.

Thanks a lot!