Member since
04-08-2016
29
Posts
2
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1593 | 12-08-2017 02:19 PM | |
7706 | 01-04-2017 02:58 PM | |
4928 | 12-08-2016 07:14 AM | |
3525 | 12-08-2016 07:12 AM | |
3272 | 06-14-2016 07:38 AM |
05-21-2018
07:21 AM
Yes! There was a snapshot. Thank you!
... View more
05-18-2018
03:12 PM
1 Kudo
CDH 5.13.1 Redhat 6.9 We wish to change the number of replications from the default of 3 copies to 2 on one particular folder in hdfs. After running this on one cluster: $ hdfs dfs -setrep -R 2 /backups and then doing a $ hdfs dfs -du / we saw that it freed the blocks very quickly and the output of fsck shows no "Over-replicated blocks": Status: HEALTHY Total size: 149514016589 B Total dirs: 27440 Total files: 128746 Total symlinks: 0 Total blocks (validated): 126355 (avg. block size 1183285 B) Minimally replicated blocks: 126355 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 2.3367577 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 3 Number of racks: 1 However on a bigger test system did the same command and even a day later still no change. $ hdfs fsck / shows "Over-replicated blocks" Status: HEALTHY Total size: 56614841380 B Total dirs: 7222 Total files: 113731 Total symlinks: 0 Total blocks (validated): 110143 (avg. block size 514012 B) Minimally replicated blocks: 110143 (100.0 %) Over-replicated blocks: 37439 (33.991264 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 2.9921465 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 8 Number of racks: 1 The number of Over-replicated blocks has reduced slightly and seems stuck at 37439. I've manually restarted each datanode service, and later restarted the entire cluster. Still stuck at 37439. I found this comment from Jarsh J: |Then monitor the over-replicated blocks in Cloudera Manager via the below chart tsquery: | |SELECT excess_blocks WHERE roleType = NAMENODE | |This should show a spike and then begin a slow but steady drop back to zero over time, which you can monitor. but when I run this query it reports "excess_blocks" is 0. $ hdfs dfs -du / 22987202359 69376013863 /backups shows 3 copies still. How to get this data space cleared? Rebalance did nothing. thanks.
... View more
Labels:
12-08-2017
02:19 PM
This problem is HDFS-9530 which has a fix in CDH-5.9.0. Bouncing the DN instances cleared this issue manually until we upgrade.
... View more
12-07-2017
09:53 AM
Seeing this issue on all data nodes. Example for one node: hadoop has its own partition bash 'du -h --max-depth=1' in hadoop partition reports 'dn' directory is consuming 207G bash 'df -h' reports hadoop partition size 296G, used 208G,Avail 73G, Use% 75% Configured Capacity: 314825441690 (293.20 GB) -- good DFS Used: 221825508284 (206.59 GB) -- good Non DFS Used: 55394479116 (51.59 GB) -- ??? bash says 1G used outside of 'dn' directory in the partition DFS Remaining: 37605454290 (35.02 GB) -- ??? bash says 73G free DFS Used%: 70.46% DFS Remaining%: 11.94% fsck reports healthy redhat 6.9 5.8.2-1.cdh5.8.2.p0.3 dfs.datanode.du.reserved == 1.96GiB how to trouble-shoot ? thanks. hdfs dfsadmin -report Configured Capacity: 1574127208450 (1.43 TB) Present Capacity: 1277963063885 (1.16 TB) DFS Remaining: 410632669242 (382.43 GB) DFS Used: 867330394643 (807.76 GB) DFS Used%: 67.87% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 hdfs fsck / Total size: 281353009325 B Total dirs: 5236 Total files: 501295 Total symlinks: 0 (Files currently being written: 37) Total blocks (validated): 501272 (avg. block size 561278 B) Minimally replicated blocks: 501272 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 5 Number of racks: 1
... View more
Labels:
12-06-2017
01:47 PM
This slow cluster restart continues to be a problem for us, especially after a redhat reboot. (to apply security updates) In the logs initially it will say there are 0 datanodes, then eventually 1,then eventually 2, and so on. (we only have 5 data nodes for this small cluster) in the meanwhile its going crazy with 'under replicated blocks' -- in reality there were no under replicated blocks prior to the cluster stop -- but appears that since it's not immediately finding them at startup, it goes into a big block recovery activity possibly hindering recovery / startup times. 1) why is it slow to discover restarted data nodes, and 2) is there a way to delay under replicated block recovery after restart. (for say 15 minutes) This appears to have another follow-on problem in that once the cluster is finally recovered, and considered green status, some impala daemons are non-functional. In this case: 3) it does not appear there is a canary test for every impala node, and 4) impala daemons which are dead, are marked green / healthy, but unable to respond to queries. Our work-around for this this is after every restart, to then bounce impala once again at the end. Then all is well. all of this behavior happens with Parcels 5.8.2 as well as 5.12.1 (slow starts, and impala health not accurate) redhat 6.9 with latest security updates.
... View more
01-06-2017
02:37 PM
RedHat Linux 6.7 CDH 5.8.2 from parcels with Cloudera Manager 5.8.2 We have built a 3 node test cluster, including high availability for NameNode, HiveMetastore and HiveServer as follows on the "hosts > Roles" page: Cluster 1 - CDH 5 Hosts Count Roles vm-54 1 M RS B DN JN G ICS ID ISS AP ES HM SM JHS NM RM S vm-[55-56] 2 RS DN FC JN NN G HMS HS2 ID NM S This table is grouped by hosts having the same roles assigned to them. vm-55 and vm-56 are essentialy identical, with cloudera manager running on vm-54. We wanted to know what happens if a node becomes corrupted and cannot be recovered. We powered down vm-55, cloned this to a new vm-57, and deleted all data directories for hadoop, zookeeper, etc. on vm-57. We started vm-57, and this node took on all roles from vm-55 right away. We were expecting to have to manually add roles to the new vm-57 but it had already taken on the roles magically. Was this because we missed some data files, or default behavior for cloudera manager when a new agent is added to the cluster with a missing server? Where can we find documentation on how to deal with a replacement of a corrupted critical server in a HA configuration. (such as running FC, JN, NN, S, etc.) thank you.
... View more
Labels:
01-06-2017
09:35 AM
The root cause seems to be that there are two 'textarea' boxes for the parameter 'java configuration options for nodemanager' and if these do not contain the same value, then nodemanager will not start. these are the two boxes: NODEMANAGER Imported From: TaskTracker (1) NODEMANAGER Imported From: TaskTracker Default Group Shouldn't Cloudera Manager not allow this condition to exist, or protect the user from this happening in the first place. Thanks. (the system at the JVM very well might be it receives an empty string for this parameter when these two do not match, just a guess)
... View more
01-04-2017
02:58 PM
found two values for the search "java configuration options for nodemanager" copy / paste to make them same (we added jmx parameters) this seems to have fixed it. needs verification.
... View more
01-04-2017
02:40 PM
thanks for your reply. we have 5 nodes configured to run NodeManager. 1 works, but 4 fails. if the "Java Configuration Options for NodeManager" was an empty string then none should start, correct? its not empty. please if you have other ideas we would appreciate it. thanks
... View more
12-14-2016
09:57 AM
Thanks for the quick response. # cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.7 (Santiago) # uname -a Linux hostname 2.6.32-642.6.2.el6.x86_64 #1 SMP Mon Oct 24 10:22:33 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Cloudera Manager version: Version: Cloudera Express 5.8.2 (#17 built by jenkins on 20160916-1426 git: d23c620f3a3bbd85d8511d6ebba49beaaab14b75) CDH Parcel version: CDH 5 5.8.2-1.cdh5.8.2.p0.3 Distributed, Activated # cat /var/log/hadoop-yarn/hadoop-cmf-yarn-NODEMANAGER-hostname.log.out http://pastebin.com/iu4hR03Q ==> we assume the SIGTERM is caused by Cloudera Agent (perhaps giving up waiting on some indication NM is running properly) # cat /var/log/cloudera-scm-agent/cloudera-scm-agent.out http://pastebin.com/8StbBsj4 ==> there are errors in here ('ValueError: dictionary update sequence element #25 has length 1; 2 is required' and 'MainThread agent ERROR Failed to activate ') There is no 'logs' directory within the process/*NODEMANAGER* (so no stderr to be found): # find /var/run/cloudera-scm-agent/process | grep 'logs\|NODEMANAGER' /var/run/cloudera-scm-agent/process/573-zookeeper-server/logs /var/run/cloudera-scm-agent/process/573-zookeeper-server/logs/stderr.log /var/run/cloudera-scm-agent/process/573-zookeeper-server/logs/stdout.log /var/run/cloudera-scm-agent/process/585-hdfs-DATANODE/logs /var/run/cloudera-scm-agent/process/585-hdfs-DATANODE/logs/stderr.log /var/run/cloudera-scm-agent/process/585-hdfs-DATANODE/logs/stdout.log /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/cloudera-monitor.properties /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/cloudera-stack-monitor.properties /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/container-executor.cfg /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/core-site.xml /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/event-filter-rules.json /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/hadoop-metrics2.properties /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/hadoop-policy.xml /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/hdfs-site.xml /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/http-auth-signature-secret /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/log4j.properties /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/mapred-site.xml /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/redaction-rules.json /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/ssl-client.xml /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/ssl-server.xml /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/topology.map /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/topology.py /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/yarn.keytab /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/yarn-site.xml /var/run/cloudera-scm-agent/process/604-impala-IMPALAD/logs /var/run/cloudera-scm-agent/process/604-impala-IMPALAD/logs/stderr.log /var/run/cloudera-scm-agent/process/604-impala-IMPALAD/logs/stdout.log So we are using Cloudera Manager. When restarting the cluster, Yarn fails to start but all other services start OK. When we drill into the Yarn 'instances' we find: JobHistory Server running ResourceManager (Active) running ResourceManager (Standby) running NodeManger (running) NodeManger (stopped) NodeManger (stopped) NodeManger (stopped) NodeManger (stopped) with Status 4 "Down" and 4 "Good Health" if we select one of the stopped instances of NodeManager, then attempt to manually start it the above pastebin logs are what we see. there is no log directory created, and thus no stderr. Cloudera Manager waits for it to start but eventually marks it failed. We are planning to deploy HA to production, and this is our test run on the QA lab system. This failure is now blocking us from proceeding with our production HA deployment. Frankly we don't even use Yarn (or MapReduce). At this point we only use HDFS and Impala. Yarn seems to be a dependency for Hive and Impala. If we are not using Yarn/MR and we can decomission these 4 failed NM instances, can the system run with a single HA pair of RM with just one instance of NM? (it would at least make Cloudera Manger happy with green status and no failures upon cluster restarts) Thanks.
... View more
12-13-2016
06:49 PM
# uname -a Linux hostname 2.6.32-642.6.2.el6.x86_64 #1 SMP Mon Oct 24 10:22:33 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Version: Cloudera Express 5.8.2 (#17 built by jenkins on 20160916-1426 git: d23c620f3a3bbd85d8511d6ebba49beaaab14b75) CDH 5 5.8.2-1.cdh5.8.2.p0.3 Distributed, Activated While reconfiguring for high availability now 4 of 5 NodeManagers won't start. there is no stderr file. http://pastebin.com/iu4hR03Q http://pastebin.com/8StbBsj4 I've tried removing the roles, then re-adding the roles. deleted all files in: /var/lib/hadoop-yarn/yarn-nm-recovery/ /var/yarn/ confirmed owners matched working node. no luck so far.
... View more
Labels:
12-08-2016
07:14 AM
increased catalog server heap resolved this problem. however, there should be a jira opened against impala daemon. if the catalog server misbehaves, impala daemon should not have queries stuck 'in flight' forever, along with consuming one cpu at 100%. (consumes an entire cpu for every stuck query)
... View more
12-08-2016
07:12 AM
increased catalog server heap resolved this problem.
... View more
12-05-2016
12:22 PM
found an out of memory heap error in catalogd. which is weird as we've been restarting the individual impala damon with 'in flight' queries stuck to recover from this problem, not the catalog server. cloudera agent restarts it automatically? heap has been increased 4x. so we are in waiting mode to see if this resolves it. thanks.
... View more
12-05-2016
09:49 AM
Version: Cloudera Express 5.8.2 (#17 built by jenkins on 20160916-1426 git: d23c620f3a3bbd85d8511d6ebba49beaaab14b75) Parcel Name Version Status Actions CDH 5 5.8.2-1.cdh5.8.2.p0.3 Distributed, Activated $ uname -a Linux hostname_redacted 2.6.32-642.6.2.el6.x86_64 #1 SMP Mon Oct 24 10:22:33 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux We initially thought we were exceeding impala-shell resources with our insert-select statement moving external csv data to an internal parquet table, however now a simple 'compute incremental stats tablename' has become stuck as well. This is causing us grief in our production environment, and we are having to constantly check port 25000, and manually restart the particular impala damon spinning the cpu. Luckily our insert scripts are fault tolerant and just repeat if fail. (but once all CPUs are consumed spinning then we are dead in the water) We are not sure but this seems to have started after we upgrade 5.71. to 5.8.2. In the logs immediately after the 'stuck' query is always this error: I1204 03:30:03.958894 7150 Frontend.java:875] analyze query compute incremental stats tablename I1204 03:30:03.959247 7150 Frontend.java:819] Requesting prioritized load of table(s): default.tablename I1204 03:32:03.970648 7150 Frontend.java:894] Missing tables were not received in 120000ms. Load request will be retried. I1204 03:32:03.970940 7150 Frontend.java:819] Requesting prioritized load of table(s): default.tablename I1204 03:32:37.981461 7142 jni-util.cc:166] com.cloudera.impala.catalog.CatalogException: Detected catalog service ID change. Aborting updateCatalog() at com.cloudera.impala.catalog.ImpaladCatalog.updateCatalog(ImpaladCatalog.java:120) at com.cloudera.impala.service.Frontend.updateCatalogCache(Frontend.java:227) at com.cloudera.impala.service.JniFrontend.updateCatalogCache(JniFrontend.java:180) I1204 03:32:37.983515 7142 status.cc:111] CatalogException: Detected catalog service ID change. Aborting updateCatalog() @ 0x80f2c9 (unknown) @ 0xb37c30 (unknown) @ 0xa4e5cf (unknown) @ 0xa68ea9 (unknown) @ 0xb00a02 (unknown) @ 0xb068f3 (unknown) @ 0xd2bed8 (unknown) @ 0xd2b114 (unknown) @ 0x7dc26c (unknown) @ 0x1b208bf (unknown) @ 0x9b0a39 (unknown) @ 0x9b1492 (unknown) @ 0xb89327 (unknown) @ 0xb89c64 (unknown) @ 0xdee99a (unknown) @ 0x3f37a07aa1 (unknown) @ 0x3f376e893d (unknown) E1204 03:32:37.983541 7142 impala-server.cc:1339] There was an error processing the impalad catalog update. Requesting a full topic update to recover: CatalogException: Detected catalog service ID change. Aborting updateCatalog()
... View more
Labels:
12-03-2016
01:51 PM
We are getting stuck 'insert' impala operations, which immediately afterwards log this error. I1203 03:44:12.598379 25140 impala-beeswax-server.cc:171] query(): query=insert into table AParquetTable (params) partition select (more params) from AnExternalCsvTable I1203 03:44:12.599138 25140 impala-beeswax-server.cc:544] TClientRequest.queryOptions: TQueryOptions { params } I1203 03:44:12.600991 25140 Frontend.java:875] analyze query insert into table AParquetTable (params) partition select (more params) from AnExternalCsvTable I1203 03:44:12.607447 25140 Frontend.java:819] Requesting prioritized load of table(s): default.aparquettable I1203 03:44:43.429272 25124 jni-util.cc:166] com.cloudera.impala.catalog.CatalogException: Detected catalog service ID change. Aborting updateCatalog() at com.cloudera.impala.catalog.ImpaladCatalog.updateCatalog(ImpaladCatalog.java:120) at com.cloudera.impala.service.Frontend.updateCatalogCache(Frontend.java:227) at com.cloudera.impala.service.JniFrontend.updateCatalogCache(JniFrontend.java:180) I1203 03:44:43.431560 25124 status.cc:111] CatalogException: Detected catalog service ID change. Aborting updateCatalog() @@@ 0x80f2c9 (unknown) @ 0xb37c30 (unknown) @ 0xa4e5cf (unknown) @ 0xa68ea9 (unknown) @ 0xb00a02 (unknown) @ 0xb068f3 (unknown) @ 0xd2bed8 (unknown) @ 0xd2b114 (unknown) @ 0x7dc26c (unknown) @ 0x1b208bf (unknown) @ 0x9b0a39 (unknown) @ 0x9b1492 (unknown) @ 0xb89327 (unknown) @@ 0xb89c64 (unknown) @ 0xdee99a (unknown) @ 0x3d05e07aa1 (unknown) @ 0x3d05ae893d (unknown) E1203 03:44:43.431589 25124 impala-server.cc:1339] There was an error processing the impalad catalog update. Requesting a full topic update to recover: CatalogException: Detected catalog service ID change. Aborting updateCatalog() I1203 04:47:33.657946 25141 impala-server.cc:1676] Connection from client 151.214.100.169:41748 closed, closing 1 associated session(s) I1203 04:47:33.658761 25141 status.cc:111] Session closed @@ 0x80f2c9 (unknown) @ 0xa727b9 (unknown) @ 0xa72cb8 (unknown) @ 0x9aac90 (unknown) @ 0x1b1ca93 (unknown) @ 0x1b03b09 (unknown) @ 0x9b0a39 (unknown) @ 0x9b1492 (unknown) @ 0xb89327 (unknown) @ 0xb89c64 (unknown) @ 0xdee99a (unknown) @ 0x3d05e07aa1 (unknown) @ 0x3d05ae893d (unknown)
... View more
11-15-2016
02:24 PM
Version: Cloudera Express 5.8.2 (#17 built by jenkins on 20160916-1426 git: d23c620f3a3bbd85d8511d6ebba49beaaab14b75) Parcel Name Version Status Actions CDH 5 5.8.2-1.cdh5.8.2.p0.3 Distributed, Activated $ uname -a Linux hostname_redacted 2.6.32-642.6.2.el6.x86_64 #1 SMP Mon Oct 24 10:22:33 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux We are seeing this error 2 to 10 times per day. What is this? Thank you. Log file created at: 2016/11/03 18:32:33 Running on machine: hostname_redacted Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg E1103 18:32:33.125335 3940 logging.cc:118] stderr will be logged to this file. E1104 03:46:23.122285 4755 impala-server.cc:1339] There was an error processing the impalad catalog update. Requesting a full topic update to recover: CatalogException: Detected catalog service ID change. Aborting updateCatalog() E1106 03:36:54.086156 4755 impala-server.cc:1339] There was an error processing the impalad catalog update. Requesting a full topic update to recover: CatalogException: Detected catalog service ID change. Aborting updateCatalog() E1108 08:16:41.835364 4764 fe-support.cc:308] RPC client failed to connect: Couldn't open transport for hostname_redacted:26000 (connect() failed: Connection refused) E1108 08:16:42.003527 4764 fe-support.cc:308] RPC client failed to connect: Couldn't open transport for hostname_redacted:26000 (connect() failed: Connection refused) E1108 08:16:59.126739 4755 impala-server.cc:1339] There was an error processing the impalad catalog update. Requesting a full topic update to recover: CatalogException: Detected catalog service ID change. Aborting updateCatalog() E1108 09:32:24.577546 4766 fe-support.cc:308] RPC client failed to connect: Couldn't open transport for hostname_redacted:26000 (connect() failed: Connection refused) E1108 09:32:24.595510 4766 fe-support.cc:308] RPC client failed to connect: Couldn't open transport for hostname_redacted:26000 (connect() failed: Connection refused) E1108 09:32:40.664857 4755 impala-server.cc:1339] There was an error processing the impalad catalog update. Requesting a full topic update to recover: CatalogException: Detected catalog service ID change. Aborting updateCatalog() E1109 03:21:00.335299 4755 impala-server.cc:1339] There was an error processing the impalad catalog update. Requesting a full topic update to recover: CatalogException: Detected catalog service ID change. Aborting updateCatalog() E1109 07:47:05.972446 4755 impala-server.cc:1339] There was an error processing the impalad catalog update. Requesting a full topic update to recover: CatalogException: Detected catalog service ID change. Aborting updateCatalog() E1112 01:14:54.006551 4755 impala-server.cc:1339] There was an error processing the impalad catalog update. Requesting a full topic update to recover: CatalogException: Detected catalog service ID change. Aborting updateCatalog() E1112 03:15:03.835086 4755 impala-server.cc:1339] There was an error processing the impalad catalog update. Requesting a full topic update to recover: CatalogException: Detected catalog service ID change. Aborting updateCatalog() E1113 06:14:55.153087 4755 impala-server.cc:1339] There was an error processing the impalad catalog update. Requesting a full topic update to recover: CatalogException: Detected catalog service ID change. Aborting updateCatalog()
... View more
Labels:
08-24-2016
03:27 PM
RedHat 6.7 Parcel: 5.7.0-1.cdh5.7.0.p0.45 RPMs: cloudera-manager-server-5.7.0-1.cm570.p0.76.el6.x86_64 cloudera-manager-agent-5.7.0-1.cm570.p0.76.el6.x86_64 cloudera-manager-daemons-5.7.0-1.cm570.p0.76.el6.x86_64 host monitor is running and the cluster is green, no issues. We are running Parcels, so everything is same version, but see this error over and over in the name log: 2016-08-24 16:06:51,497 WARN BlockStateChange: BLOCK* processReport: Report from the DataNode (dc43ee8e-42ea-4d17-afb0-dc4816c5e4ca) is unsorted. This will cause overhead on the NameNode which needs to sort the Full BR. Please update the DataNode to the same version of Hadoop HDFS as the NameNode (2.6.0-cdh5.7.0).
... View more
Labels:
08-24-2016
03:01 PM
1 Kudo
RedHat 6.7 Parcel: 5.7.0-1.cdh5.7.0.p0.45 RPMs: cloudera-manager-server-5.7.0-1.cm570.p0.76.el6.x86_64 cloudera-manager-agent-5.7.0-1.cm570.p0.76.el6.x86_64 cloudera-manager-daemons-5.7.0-1.cm570.p0.76.el6.x86_64 host monitor is running and the cluster is green, no issues. saw other posts relating to this warn message, but in our case the system was created recently and basically doing nothing. runs impala and shared filesystem. so far do not observe any failures of either of these features. please advise on how to repair this issue and determine what failures this might be causing. the event log full of chatter/noise, and this is one of the issues reported over and over. 2016-08-24 15:55:48,385 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /10.10.10.120:53028 2016-08-24 15:55:48,387 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x3567b06b3cc5dd5 with negotiated timeout 30000 for client /10.10.10.120:53028 2016-08-24 15:55:48,404 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /10.10.10.120:53032 2016-08-24 15:55:48,406 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /10.10.10.120:53032 2016-08-24 15:55:48,407 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x3567b06b3cc5dd6 with negotiated timeout 30000 for client /10.10.10.120:53032 2016-08-24 15:55:48,440 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x1567b06b3ce5ea6 2016-08-24 15:55:48,441 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x3567b06b3cc5dd6 2016-08-24 15:55:48,442 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.10.10.120:53032 which had sessionid 0x3567b06b3cc5dd6 2016-08-24 15:55:48,443 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x2567b0773d75e32 2016-08-24 15:55:48,448 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x3567b06b3cc5dd5 2016-08-24 15:55:48,449 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x3567b06b3cc5dd5, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Unknown Source) 2016-08-24 15:55:48,449 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.10.10.120:53028 which had sessionid 0x3567b06b3cc5dd5 2016-08-24 15:56:48,413 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /10.10.10.120:53180 2016-08-24 15:56:48,413 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /10.10.10.120:53180 2016-08-24 15:56:48,414 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x3567b06b3cc5dd7 with negotiated timeout 30000 for client /10.10.10.120:53180 2016-08-24 15:56:48,450 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x1567b06b3ce5ea8 2016-08-24 15:56:48,455 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x3567b06b3cc5dd7 2016-08-24 15:56:48,456 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.10.10.120:53180 which had sessionid 0x3567b06b3cc5dd7 2016-08-24 15:56:48,456 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x2567b0773d75e33 2016-08-24 15:56:48,459 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x1567b06b3ce5ea7 2016-08-24 15:57:53,389 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /10.10.10.120:53342 2016-08-24 15:57:53,390 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /10.10.10.120:53342 2016-08-24 15:57:53,391 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x3567b06b3cc5dd8 with negotiated timeout 30000 for client /10.10.10.120:53342 2016-08-24 15:57:53,416 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /10.10.10.120:53348
... View more
Labels:
08-24-2016
02:43 PM
RedHat 6.7 Parcel: 5.7.0-1.cdh5.7.0.p0.45 RPMs: cloudera-manager-server-5.7.0-1.cm570.p0.76.el6.x86_64 cloudera-manager-agent-5.7.0-1.cm570.p0.76.el6.x86_64 cloudera-manager-daemons-5.7.0-1.cm570.p0.76.el6.x86_64 host monitor is running and the cluster is green, no issues. however the event log full of chatter/noise, and this is one of the issues reported over and over. # tail -n 100 /var/log/cloudera-scm-firehose/mgmt-cmf-mgmt-HOSTMONITOR-hostname.log.out 2016-08-24 15:45:47,969 ERROR com.cloudera.cmf.BasicScmProxy: Failed request to SCM: 302 2016-08-24 15:45:48,969 INFO com.cloudera.cmf.BasicScmProxy: Authentication to SCM required. 2016-08-24 15:45:49,027 INFO com.cloudera.cmf.BasicScmProxy: Using encrypted credentials for SCM 2016-08-24 15:45:49,031 INFO com.cloudera.cmf.BasicScmProxy: Authenticated to SCM. 2016-08-24 15:47:55,591 INFO com.cloudera.cmon.tstore.leveldb.LDBTimeSeriesRollupManager: Running the LDBTimeSeriesRollupManager at 2016-08-24T20:47:55.591Z, forMigratedData=false 2016-08-24 15:49:56,764 INFO com.cloudera.cmon.tstore.leveldb.LDBResourceManager: Closed: 0 partitions 2016-08-24 15:52:55,591 INFO com.cloudera.cmon.tstore.leveldb.LDBTimeSeriesRollupManager: Running the LDBTimeSeriesRollupManager at 2016-08-24T20:52:55.591Z, forMigratedData=false 2016-08-24 15:52:55,591 INFO com.cloudera.cmon.tstore.leveldb.LDBTimeSeriesRollupManager: Starting rollup from raw to rollup=TEN_MINUTELY for rollupTimestamp=2016-08-24T20:50:00.000Z 2016-08-24 15:52:56,438 INFO com.cloudera.cmon.tstore.leveldb.LDBTimeSeriesRollupManager: Finished rollup: duration=PT0.847S, numStreamsChecked=41065, numStreamsRolledUp=3228 2016-08-24 15:57:55,591 INFO com.cloudera.cmon.tstore.leveldb.LDBTimeSeriesRollupManager: Running the LDBTimeSeriesRollupManager at 2016-08-24T20:57:55.591Z, forMigratedData=false 2016-08-24 15:59:56,767 INFO com.cloudera.cmon.tstore.leveldb.LDBResourceManager: Closed: 0 partitions 2016-08-24 16:02:55,591 INFO com.cloudera.cmon.tstore.leveldb.LDBTimeSeriesRollupManager: Running the LDBTimeSeriesRollupManager at 2016-08-24T21:02:55.591Z, forMigratedData=false 2016-08-24 16:02:55,591 INFO com.cloudera.cmon.tstore.leveldb.LDBTimeSeriesRollupManager: Starting rollup from raw to rollup=TEN_MINUTELY for rollupTimestamp=2016-08-24T21:00:00.000Z 2016-08-24 16:02:56,527 INFO com.cloudera.cmon.tstore.leveldb.LDBTimeSeriesRollupManager: Finished rollup: duration=PT0.936S, numStreamsChecked=41065, numStreamsRolledUp=3228 2016-08-24 16:02:56,527 INFO com.cloudera.cmon.tstore.leveldb.LDBTimeSeriesRollupManager: Starting rollup from ts_stream_rollup_PT600S to rollup=HOURLY for rollupTimestamp=2016-08-24T21:00:00.000Z 2016-08-24 16:02:57,381 INFO com.cloudera.cmon.tstore.leveldb.LDBTimeSeriesRollupManager: Finished rollup: duration=PT0.854S, numStreamsChecked=41065, numStreamsRolledUp=3228 2016-08-24 16:07:55,591 INFO com.cloudera.cmon.tstore.leveldb.LDBTimeSeriesRollupManager: Running the LDBTimeSeriesRollupManager at 2016-08-24T21:07:55.591Z, forMigratedData=false 2016-08-24 16:09:56,772 INFO com.cloudera.cmon.tstore.leveldb.LDBResourceManager: Closed: 0 partitions 2016-08-24 16:12:55,591 INFO com.cloudera.cmon.tstore.leveldb.LDBTimeSeriesRollupManager: Running the LDBTimeSeriesRollupManager at 2016-08-24T21:12:55.591Z, forMigratedData=false 2016-08-24 16:12:55,591 INFO com.cloudera.cmon.tstore.leveldb.LDBTimeSeriesRollupManager: Starting rollup from raw to rollup=TEN_MINUTELY for rollupTimestamp=2016-08-24T21:10:00.000Z 2016-08-24 16:12:56,282 INFO com.cloudera.cmon.tstore.leveldb.LDBTimeSeriesRollupManager: Finished rollup: duration=PT0.691S, numStreamsChecked=41065, numStreamsRolledUp=3228 2016-08-24 16:16:47,102 ERROR com.cloudera.cmf.BasicScmProxy: Failed request to SCM: 302 2016-08-24 16:16:48,103 INFO com.cloudera.cmf.BasicScmProxy: Authentication to SCM required. 2016-08-24 16:16:48,160 INFO com.cloudera.cmf.BasicScmProxy: Using encrypted credentials for SCM 2016-08-24 16:16:48,165 INFO com.cloudera.cmf.BasicScmProxy: Authenticated to SCM.
... View more
08-08-2016
12:03 PM
We are sure the name nodes failed over, but its difficult to find this in the logs as there is so much noise in there. (that and we were doing testing with starting / stopping the system) are there key words to seach for when this happens? we also cannot find any kind of history of failovers in the SCM portal. the UI is really lacking in this area. any suggestions on how to track down a proper instance of the failover event would be appreciated? (keyword in log, etc.)
... View more
08-08-2016
11:55 AM
http://pastebin.com/gRaPXLiN http://pastebin.com/3d7CSuSG http://pastebin.com/DCGr2EBB http://pastebin.com/N2FZ33nH
... View more
08-08-2016
08:04 AM
We have two name nodes (inf-v1 & inf-v2) and occasionally we've observed the primary name service became standby, and the standby became active (with no changes by us). how to determine the cause of this problem? 2016-06-30 10:48:35,376 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController: -- Local NN thread dump --
Process Thread Dump:
15 active threads
Thread 23 (Timer-0):
State: TIMED_WAITING
Blocked count: 0
Waited count: 1
Stack:
java.lang.Object.wait(Native Method)
java.util.TimerThread.mainLoop(Unknown Source)
java.util.TimerThread.run(Unknown Source)
Thread 22 (229116977@qtp-1041125371-1 - Acceptor0 HttpServer2$SelectChannelConnectorWithSafeStartup@inf-v2.archive.com:50070):
State: RUNNABLE
Blocked count: 1
Waited count: 1
Stack:
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(Unknown Source)
sun.nio.ch.EPollSelectorImpl.doSelect(Unknown Source)
sun.nio.ch.SelectorImpl.lockAndDoSelect(Unknown Source)
sun.nio.ch.SelectorImpl.select(Unknown Source)
org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:511)
org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:193)
org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708)
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Thread 21 (53050680@qtp-1041125371-0):
State: RUNNABLE
Blocked count: 2
Waited count: 1
Stack:
sun.management.ThreadImpl.getThreadInfo1(Native Method)
sun.management.ThreadImpl.getThreadInfo(Unknown Source)
sun.management.ThreadImpl.getThreadInfo(Unknown Source)
org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:165)
org.apache.hadoop.http.HttpServer2$StackServlet.doGet(HttpServer2.java:1155)
javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1286)
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
Thread 20 (pool-2-thread-1):
State: TIMED_WAITING
Blocked count: 0
Waited count: 1
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
java.lang.Thread.run(Unknown Source)
Thread 18 (org.apache.hadoop.util.JvmPauseMonitor$Monitor@28badca2):
State: TIMED_WAITING
Blocked count: 1
Waited count: 7
Stack:
java.lang.Thread.sleep(Native Method)
org.apache.hadoop.util.JvmPauseMonitor$Monitor.run(JvmPauseMonitor.java:181)
java.lang.Thread.run(Unknown Source)
Thread 17 (Timer for 'NameNode' metrics system):
State: TIMED_WAITING
Blocked count: 0
Waited count: 1
Stack:
java.lang.Object.wait(Native Method)
java.util.TimerThread.mainLoop(Unknown Source)
java.util.TimerThread.run(Unknown Source)
Thread 14 (process reaper):
State: TIMED_WAITING
Blocked count: 1
Waited count: 2
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(Unknown Source)
java.util.concurrent.SynchronousQueue$TransferStack.transfer(Unknown Source)
java.util.concurrent.SynchronousQueue.poll(Unknown Source)
java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
java.lang.Thread.run(Unknown Source)
Thread 13 (event filterer):
State: TIMED_WAITING
Blocked count: 0
Waited count: 74
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
java.util.concurrent.LinkedBlockingQueue.poll(Unknown Source)
com.cloudera.cmf.eventcatcher.client.logs.LogEventProcessor.runFiltering(LogEventProcessor.java:132)
com.cloudera.cmf.eventcatcher.client.logs.LogEventProcessor.access$000(LogEventProcessor.java:28)
com.cloudera.cmf.eventcatcher.client.logs.LogEventProcessor$1.run(LogEventProcessor.java:81)
Thread 12 (RMI TCP Accept-0):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
java.net.PlainSocketImpl.socketAccept(Native Method)
java.net.AbstractPlainSocketImpl.accept(Unknown Source)
java.net.ServerSocket.implAccept(Unknown Source)
java.net.ServerSocket.accept(Unknown Source)
sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(Unknown Source)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(Unknown Source)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(Unknown Source)
java.lang.Thread.run(Unknown Source)
Thread 11 (RMI TCP Accept-8064):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
java.net.PlainSocketImpl.socketAccept(Native Method)
java.net.AbstractPlainSocketImpl.accept(Unknown Source)
java.net.ServerSocket.implAccept(Unknown Source)
java.net.ServerSocket.accept(Unknown Source)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(Unknown Source)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(Unknown Source)
java.lang.Thread.run(Unknown Source)
Thread 10 (RMI TCP Accept-0):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
java.net.PlainSocketImpl.socketAccept(Native Method)
java.net.AbstractPlainSocketImpl.accept(Unknown Source)
java.net.ServerSocket.implAccept(Unknown Source)
java.net.ServerSocket.accept(Unknown Source)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(Unknown Source)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(Unknown Source)
java.lang.Thread.run(Unknown Source)
Thread 5 (Signal Dispatcher):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
Thread 3 (Finalizer):
State: WAITING
Blocked count: 5
Waited count: 2
Waiting on java.lang.ref.ReferenceQueue$Lock@d17ffd
Stack:
java.lang.Object.wait(Native Method)
java.lang.ref.ReferenceQueue.remove(Unknown Source)
java.lang.ref.ReferenceQueue.remove(Unknown Source)
java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
Thread 2 (Reference Handler):
State: WAITING
Blocked count: 1
Waited count: 2
Waiting on java.lang.ref.Reference$Lock@3a29bb76
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
java.lang.ref.Reference$ReferenceHandler.run(Unknown Source)
Thread 1 (main):
State: RUNNABLE
Blocked count: 4
Waited count: 3
Stack:
java.util.zip.ZipFile.getEntry(Native Method)
java.util.zip.ZipFile.getEntry(Unknown Source)
java.util.jar.JarFile.getEntry(Unknown Source)
java.util.jar.JarFile.getJarEntry(Unknown Source)
sun.misc.URLClassPath$JarLoader.getResource(Unknown Source)
sun.misc.URLClassPath$JarLoader.findResource(Unknown Source)
sun.misc.URLClassPath.findResource(Unknown Source)
java.net.URLClassLoader$2.run(Unknown Source)
java.net.URLClassLoader$2.run(Unknown Source)
java.security.AccessController.doPrivileged(Native Method)
java.net.URLClassLoader.findResource(Unknown Source)
java.lang.ClassLoader.getResource(Unknown Source)
java.net.URLClassLoader.getResourceAsStream(Unknown Source)
org.apache.xerces.parsers.SecuritySupport$6.run(Unknown Source)
java.security.AccessController.doPrivileged(Native Method)
org.apache.xerces.parsers.SecuritySupport.getResourceAsStream(Unknown Source)
org.apache.xerces.parsers.ObjectFactory.findJarServiceProvider(Unknown Source)
org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
org.apache.xerces.parsers.DOMParser.<init>(Unknown Source)
-- Local NN thread dump --
... View more
- Tags:
- HA
Labels:
08-08-2016
07:56 AM
These errors are filling the event log but does not seem to cause any issues, and all health is reported green. WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:impala (auth:SIMPLE) cause:org.apache.hadoop.hdfs.protocol.AclException: The ACL operation has been rejected. Support for ACLs has been disabled by setting dfs.namenode.acls.enabled to false. org.apache.hadoop.ipc.Server: IPC Server handler 11 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getAclStatus from 10.10.10.80:46973 Call#8 Retry#0: org.apache.hadoop.hdfs.protocol.AclException: The ACL operation has been rejected. Support for ACLs has been disabled by setting dfs.namenode.acls.enabled to false. We would like to fix the root cause for this to reduce noise in the logs. Thanks
... View more
Labels:
06-30-2016
07:18 AM
Increased heap and safe mode now lasts 17 minutes, which is an improvement. Now that we recently switched from rpms to parcel packages, will now implement high availability for name service (NN x 2 instead of NN + SNN) and will have a restart procedure with manual failovers hopefully will avoid this long restart going forward. Thanks.
... View more
06-14-2016
08:48 AM
These four example messages appear over and over in the name service log: 2016-06-12 00:15:06,486 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.10.10.84:50010 is added to blk_1074147634_1099546079234{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-6cb61d85-d3d4-4e98-ae2d-0313ab8941c4:NORMAL:10.10.10.83:50010|RBW], ReplicaUnderConstruction[[DISK]DS-d7077514-495d-42b2-a2ec-9f76fcc43db4:NORMAL:10.10.10.80:50010|RBW], ReplicaUnderConstruction[[DISK]DS-4b57fed3-06c0-444e-b217-e67f36edff97:NORMAL:10.10.10.84:50010|RBW]]} size 0 2016-06-12 15:23:56,496 INFO BlockStateChange: BLOCK* addToInvalidates: blk_1074163953_1099546095553 10.10.10.85:50010 10.10.10.84:50010 10.10.10.83:50010 2016-06-12 15:24:27,731 INFO BlockStateChange: BLOCK* BlockManager: ask 10.10.10.80:50010 to delete [blk_1074163959_1099546095559] 2016-06-13 12:12:19,705 INFO BlockStateChange: BLOCK* ask 10.10.10.83:50010 to replicate blk_1073950158_1099545881758 to datanode(s) 10.10.10.85:50010 SCM 5.7.0 & CDH 5.7.0 on RedHat 6.7. Another test system almost identical to the trouble system can restart the cluster in about 5 minutes. $ hdfs fsck / reports healthy system, and 0 under replicated blocks. This system always recovers, however every restart is taking an hour. We have redundant directorys on different partitions for the name node fsimage, along with another host running secondary name service. checkpoints complete once the safe mode is off.
... View more
Labels:
06-14-2016
07:38 AM
CDH 5.7.0 is not having this issue.
... View more
04-11-2016
01:08 PM
Previously was on CDH 4.7.1 with SCM 4.8.6 and did not experience this problem. Currently testing CDH 5.5.1 with SCM 5.5.1. Hive Metastore is running on same host as Oracle, started via service cloudera-scm-agent on runlevels 345. After a reboot, hive metastore gets stuck: [main]: Failed initialising database.
Unable to open a test connection to the given database. JDBC url = jdbc:oracle:thin:@ //i11-db-01:1521/odpdd, username = hiveuser. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------ java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection Once in this condition I can restart the 'hive1' service via the SCM portal, then impala and the SCM portal return to functioning properly. As a test I disabled cloudera-scm-agent with: # chkconfig cloudera-scm-agent off Then did a reboot, monitored oracle connections available from a separate workstation using 'SQuirreL SQL Client' and once succeeded performed: # service cloudera-scm-agent start Came up fine. We guess that even though Oracle starts on runlevel 2, it is not ready for connections when hive metastore begins, and hive metastore not gracefully handling this scenario. Any suggestions? What about this 'lazyinit' parameter? Thank you. Apr 11, 2:12:30.623 PM INFO org.apache.hadoop.hive.metastore.HiveMetaStore
[main]: Starting hive metastore on port 9083
Apr 11, 2:12:34.235 PM INFO org.apache.hadoop.hive.metastore.HiveMetaStore
[main]: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
Apr 11, 2:12:35.866 PM INFO org.apache.hadoop.hive.metastore.ObjectStore
[main]: ObjectStore, initialize called
Apr 11, 2:12:37.265 PM INFO DataNucleus.Persistence
[main]: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
Apr 11, 2:12:37.265 PM INFO DataNucleus.Persistence
[main]: Property datanucleus.cache.level2 unknown - will be ignored
Apr 11, 2:12:40.032 PM ERROR DataNucleus.Datastore.Schema
[main]: Failed initialising database.
Unable to open a test connection to the given database. JDBC url = jdbc:oracle:thin:@ //i11-db-01:1521/odpdd, username = hiveuser. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------ java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:419)
at oracle.jdbc.driver.PhysicalConnection.<init>(PhysicalConnection.java:536)
at oracle.jdbc.driver.T4CConnection.<init>(T4CConnection.java:228)
at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:32)
at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:521)
at java.sql.DriverManager.getConnection(Unknown Source)
at java.sql.DriverManager.getConnection(Unknown Source)
at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361)
at com.jolbox.bonecp.BoneCP.<init>(BoneCP.java:416)
at com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:120)
at org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:501)
at org.datanucleus.store.rdbms.RDBMSStoreManager.<init>(RDBMSStoreManager.java:298)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
at org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
at java.security.AccessController.doPrivileged(Native Method)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:406)
at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:435)
at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:330)
at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:286)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:56)
at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:65)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:596)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:574)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:623)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:464)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:78)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5775)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5770)
at org.apache.hadoop.hive.metastore.HiveMetaStore.startMetaStore(HiveMetaStore.java:6022)
at org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:5947)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: oracle.net.ns.NetException: The Network Adapter could not establish the connection
at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:375)
at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:422)
at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:678)
at oracle.net.ns.NSProtocol.connect(NSProtocol.java:238)
at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:1054)
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:308)
... 56 more
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
at java.net.SocksSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:209)
at oracle.net.nt.ConnOption.connect(ConnOption.java:123)
at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:353)
... 61 more
------
... View more