Member since
02-05-2015
22
Posts
2
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
9950 | 02-23-2015 01:18 PM | |
4784 | 02-06-2015 09:57 AM |
02-26-2015
02:38 PM
1 Kudo
After checking the individual web pages for the HDFS namenodes (port 50070) and HBase masters (port 60010) I have discovered that the services are actually working fine. There is an Active Namenode and a StandBy Namenode and the same for HBase: one Active Master and one StandBy master. So, how come that the Cloudera Manager is not able to detect the state of this processes? What is the procedure that the CM uses? A REST call, a connection to a port, an entry in a database? I need to know how is done so I can keep tracking the problem. Thanks!
... View more
02-24-2015
04:02 PM
After doing more research, I have found that the namenode that does not report state might be the active one. The logs for its FailoverController show: 2015-02-24 22:50:06,386 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected. 2015-02-24 22:50:06,391 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced... 2015-02-24 22:50:06,398 INFO org.apache.hadoop.ha.ActiveStandbyElector: No old node to fence 2015-02-24 22:50:06,399 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/mynameservice/ActiveBreadCrumb to indicate that the local node is the most recent active... 2015-02-24 22:50:06,403 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at aws.us-west2a.ccs-nn-2.dev.cypher/10.2.3.22:8022 active... 2015-02-24 22:50:09,679 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at aws.us-west2a.ccs-nn-2.dev.cypher/10.2.3.22:8022 to active state I am assuming that this could be the case. If the namenode is active, then the problem may be in the service monitor. This is just an hypothesis though. Does it make sense?
... View more
02-24-2015
02:39 PM
1 Kudo
Simple question When I deploy zookeeper programatically I change the datadir using the Cloudera API with the variable: conf = {"dataDir": "/mnt/mydir" } that I pass to the zookeeper service with: zook_service.update_config(conf) This indeed works and I can see the value in the Cloudera Manager But when I check the /etc/zookeeper/conf/zoo.cfg file I see: dataDir=/var/lib/zookeeper Which one takes precedence?
... View more
Labels:
- Labels:
-
Apache Zookeeper
-
Cloudera Manager
02-24-2015
10:27 AM
I have another piece of information that may help: HBase does exactly the same thing. There are two masters but none is active: 2015-02-24 18:22:33,721 INFO org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x14bb91dd33c03d7 type:create cxid:0x8 zxid:0xe00002e28 txntype:-1 reqpath:n/a Error Path:/hbase/online-snapshot/acquired Error:KeeperErrorCode = NodeExists for /hbase/online-snapshot/acquired And MapReduce works fine, it does have an active JobTracker and the other one is on Stanby.
... View more
02-24-2015
10:21 AM
As I mentioned in the previous post, the error still appears, but afeter 5min now I have an active Namenode and the following error: NameNode summary: aws.us-west2a.ccs-nn-1.dev.cypher (Availability: Active, Health: Good), aws.us-west2a.ccs-nn-2.dev.cypher (Availability: Unknown, Health: Good). This health test is concerning because the Service Monitor did not find a standby NameNode. Zookeeper does not log anything interesting. I am sure that is related with the zookeeper lock again. One machine gets its role first (either standby or active), and the other one does not know what to do.
... View more
02-24-2015
10:06 AM
Thanks Gautam Your suggestion was indeed very useful. I went to the zookeeper logs and found: 2015-02-24 18:01:05,616 INFO org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x34bb91dede303c9 type:create cxid:0x1 zxid:0xe00002ce3 txntype:-1 reqpath:n/a Error Path:/hadoop-ha/mynameservice/ActiveStandbyElectorLock Error:KeeperErrorCode = NodeExists for /hadoop-ha/mynameservice/ActiveStandbyElectorLock I thus tried this: 1) Stop hdfs 2) delete the node /hadoop-ha/mynameservice/ActiveStandbyElectorLock from zookeeper 3) Restart hdfs BUT I still got the same error. It seems that some other process or an early stage of the hdfs start process creates that node. Any input on this?
... View more
02-23-2015
05:47 PM
Hi, I am trying to deploy CDH4 with CM 5.3. I managed to enable HA programatically but unfortunately none of the Namenodes becomes the active one. What I see most frequently is a situation where one of them is a StandBy namenode and the other one does not report any state (neither active nor standby). The health test reports: NameNode summary: aws.us-west2a.ccs-nn-1.dev.cypher (Availability: Standby, Health: Good), aws.us-west2a.ccs-nn-2.dev.cypher (Availability: Unknown, Health: Good). This health test is bad because the Service Monitor did not find an active NameNode. Digging into the logs I see: 2015-02-24 01:14:38,197 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode aws.us-west2a.ccs-nn-2.dev.cypher/10.2.3.22:8022 2015-02-24 01:14:38,282 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Log not rolled. Name node is in safe mode. Manually, using Cloudera Manager, I force both of the namenodes to leave safe mode and restart the service. Then I observe two behaviors. 1) Same thing as before, one namenode is in StandBy, and the other does not report state. 2) One of the namenodes is active and the other one does not report state. The logs report: 015-02-24 01:10:41,417 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state 2015-02-24 01:10:41,421 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting recovery process for unclosed journal segments... 2015-02-24 01:10:41,437 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Successfully started new epoch 2 2015-02-24 01:10:41,437 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering unfinalized segments in /mnt/data1/dfs/nn/current 2015-02-24 01:10:41,441 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Catching up to latest edits from old active before taking over writer role in edits logs 2015-02-24 01:10:41,463 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@4feaefc5 expecting start txid #1177 2015-02-24 01:10:41,464 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding stream '/mnt/data1/dfs/nn/current/edits_0000000000000001198-0000000000000001198' to transaction ID 1177 2015-02-24 01:10:41,473 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException: There appears to be a gap in the edit log. We expected txid 1177, but got txid 1198. 2015-02-24 01:10:41,473 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Error encountered requiring NN shutdown. Shutting down immediately. Any help is appreciated.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Cloudera Manager
-
HDFS
-
Security
02-23-2015
01:18 PM
After some research I found the solution: http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_ig_ports_cdh5.html The Cloudera Manager machine must be able to connect to all the ports of a zookeeper server
... View more
02-23-2015
11:07 AM
Hi, I am setting up a Hadoop cluster and I am having trouble with Zookeeper health tests. Our devops team has closed all ports in the machines except those used by zookeeper, namely: 2181, 3181, and 4181. That makes the Zookeeper canary test to fail. If we open all ports in the machines, the canary test passes. I have seen two types in problems in the logs: WARN org.apache.zookeeper.server.quorum.Learner: Unexpected exception, tries=2, connecting to myhostname/10.2.3.23:3181 java.net.ConnectException: Connection refused. Which is odd, because port 3181 is open. and: INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /10.2.2.11:50561 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x14ba997bb9d0e82 with negotiated timeout 30000 for client /10.2.2.11:50561 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.2.2.11:50561 which had sessionid 0x14ba997bb9d0e82 WARN com.cloudera.cmf.event.publish.EventStorePublisherWithRetry: Failed to publish event: SimpleEvent{attributes={ROLE_TYPE=[SERVER], CATEGORY=[LOG_MESSAGE], ROLE=[zookeeper-zk-1], SEVERITY=[IMPORTANT], SERVICE=[zookeeper], HOST_IDS=[a_hostname], SERVICE_TYPE=[ZOOKEEPER], LOG_LEVEL=[WARN], HOSTS=[a_hostname], EVENTCODE=[EV_LOG_EVENT]}, content=Got zxid 0x200000001 expected 0x1, timestamp=1424479536132}. The ports that I mark in red change all over the logs, as if Zookeeper was trying to connect to different ports each time. I don't know which error is causing the problem, if any. Can you help me with this issue. Or, if you know the ports that Cloudera Manager uses to perform the zookeeper canary test, do you mind sharing them?
... View more
Labels:
02-06-2015
09:57 AM
Thanks Guatam and Darren for your suggestions. I made things work. I am guessing that my troubles appeared because I was debugging step by step, creating and destroying cluster and services many times. Once that I started the entire process from beginning to end (after debugging each step), all seems to work and I can see CDH 4 version for each host. Cheers Javier
... View more