About ScottChris

ScottChris · ‎05-21-2018

Yes! There was a snapshot. Thank you!

ScottChris · ‎05-18-2018

CDH 5.13.1 Redhat 6.9 We wish to change the number of replications from the default of 3 copies to 2 on one particular folder in hdfs. After running this on one cluster: $ hdfs dfs -setrep -R 2 /backups and then doing a $ hdfs dfs -du / we saw that it freed the blocks very quickly and the output of fsck shows no "Over-replicated blocks": Status: HEALTHY Total size: 149514016589 B Total dirs: 27440 Total files: 128746 Total symlinks: 0 Total blocks (validated): 126355 (avg. block size 1183285 B) Minimally replicated blocks: 126355 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 2.3367577 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 3 Number of racks: 1 However on a bigger test system did the same command and even a day later still no change. $ hdfs fsck / shows "Over-replicated blocks" Status: HEALTHY Total size: 56614841380 B Total dirs: 7222 Total files: 113731 Total symlinks: 0 Total blocks (validated): 110143 (avg. block size 514012 B) Minimally replicated blocks: 110143 (100.0 %) Over-replicated blocks: 37439 (33.991264 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 2.9921465 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 8 Number of racks: 1 The number of Over-replicated blocks has reduced slightly and seems stuck at 37439. I've manually restarted each datanode service, and later restarted the entire cluster. Still stuck at 37439. I found this comment from Jarsh J: |Then monitor the over-replicated blocks in Cloudera Manager via the below chart tsquery: | |SELECT excess_blocks WHERE roleType = NAMENODE | |This should show a spike and then begin a slow but steady drop back to zero over time, which you can monitor. but when I run this query it reports "excess_blocks" is 0. $ hdfs dfs -du / 22987202359 69376013863 /backups shows 3 copies still. How to get this data space cleared? Rebalance did nothing. thanks.

ScottChris · ‎12-08-2017

This problem is HDFS-9530 which has a fix in CDH-5.9.0. Bouncing the DN instances cleared this issue manually until we upgrade.

ScottChris · ‎12-07-2017

Seeing this issue on all data nodes. Example for one node: hadoop has its own partition bash 'du -h --max-depth=1' in hadoop partition reports 'dn' directory is consuming 207G bash 'df -h' reports hadoop partition size 296G, used 208G,Avail 73G, Use% 75% Configured Capacity: 314825441690 (293.20 GB) -- good DFS Used: 221825508284 (206.59 GB) -- good Non DFS Used: 55394479116 (51.59 GB) -- ??? bash says 1G used outside of 'dn' directory in the partition DFS Remaining: 37605454290 (35.02 GB) -- ??? bash says 73G free DFS Used%: 70.46% DFS Remaining%: 11.94% fsck reports healthy redhat 6.9 5.8.2-1.cdh5.8.2.p0.3 dfs.datanode.du.reserved == 1.96GiB how to trouble-shoot ? thanks. hdfs dfsadmin -report Configured Capacity: 1574127208450 (1.43 TB) Present Capacity: 1277963063885 (1.16 TB) DFS Remaining: 410632669242 (382.43 GB) DFS Used: 867330394643 (807.76 GB) DFS Used%: 67.87% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 hdfs fsck / Total size: 281353009325 B Total dirs: 5236 Total files: 501295 Total symlinks: 0 (Files currently being written: 37) Total blocks (validated): 501272 (avg. block size 561278 B) Minimally replicated blocks: 501272 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 5 Number of racks: 1

ScottChris · ‎01-06-2017

The root cause seems to be that there are two 'textarea' boxes for the parameter 'java configuration options for nodemanager' and if these do not contain the same value, then nodemanager will not start. these are the two boxes: NODEMANAGER Imported From: TaskTracker (1) NODEMANAGER Imported From: TaskTracker Default Group Shouldn't Cloudera Manager not allow this condition to exist, or protect the user from this happening in the first place. Thanks. (the system at the JVM very well might be it receives an empty string for this parameter when these two do not match, just a guess)

ScottChris · ‎01-04-2017

found two values for the search "java configuration options for nodemanager" copy / paste to make them same (we added jmx parameters) this seems to have fixed it. needs verification.

ScottChris · ‎01-04-2017

thanks for your reply. we have 5 nodes configured to run NodeManager. 1 works, but 4 fails. if the "Java Configuration Options for NodeManager" was an empty string then none should start, correct? its not empty. please if you have other ideas we would appreciate it. thanks

ScottChris · ‎12-14-2016

Thanks for the quick response. # cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.7 (Santiago) # uname -a Linux hostname 2.6.32-642.6.2.el6.x86_64 #1 SMP Mon Oct 24 10:22:33 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Cloudera Manager version: Version: Cloudera Express 5.8.2 (#17 built by jenkins on 20160916-1426 git: d23c620f3a3bbd85d8511d6ebba49beaaab14b75) CDH Parcel version: CDH 5 5.8.2-1.cdh5.8.2.p0.3 Distributed, Activated # cat /var/log/hadoop-yarn/hadoop-cmf-yarn-NODEMANAGER-hostname.log.out http://pastebin.com/iu4hR03Q ==> we assume the SIGTERM is caused by Cloudera Agent (perhaps giving up waiting on some indication NM is running properly) # cat /var/log/cloudera-scm-agent/cloudera-scm-agent.out http://pastebin.com/8StbBsj4 ==> there are errors in here ('ValueError: dictionary update sequence element #25 has length 1; 2 is required' and 'MainThread agent ERROR Failed to activate ') There is no 'logs' directory within the process/*NODEMANAGER* (so no stderr to be found): # find /var/run/cloudera-scm-agent/process | grep 'logs\|NODEMANAGER' /var/run/cloudera-scm-agent/process/573-zookeeper-server/logs /var/run/cloudera-scm-agent/process/573-zookeeper-server/logs/stderr.log /var/run/cloudera-scm-agent/process/573-zookeeper-server/logs/stdout.log /var/run/cloudera-scm-agent/process/585-hdfs-DATANODE/logs /var/run/cloudera-scm-agent/process/585-hdfs-DATANODE/logs/stderr.log /var/run/cloudera-scm-agent/process/585-hdfs-DATANODE/logs/stdout.log /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/cloudera-monitor.properties /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/cloudera-stack-monitor.properties /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/container-executor.cfg /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/core-site.xml /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/event-filter-rules.json /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/hadoop-metrics2.properties /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/hadoop-policy.xml /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/hdfs-site.xml /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/http-auth-signature-secret /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/log4j.properties /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/mapred-site.xml /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/redaction-rules.json /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/ssl-client.xml /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/ssl-server.xml /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/topology.map /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/topology.py /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/yarn.keytab /var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/yarn-site.xml /var/run/cloudera-scm-agent/process/604-impala-IMPALAD/logs /var/run/cloudera-scm-agent/process/604-impala-IMPALAD/logs/stderr.log /var/run/cloudera-scm-agent/process/604-impala-IMPALAD/logs/stdout.log So we are using Cloudera Manager. When restarting the cluster, Yarn fails to start but all other services start OK. When we drill into the Yarn 'instances' we find: JobHistory Server running ResourceManager (Active) running ResourceManager (Standby) running NodeManger (running) NodeManger (stopped) NodeManger (stopped) NodeManger (stopped) NodeManger (stopped) with Status 4 "Down" and 4 "Good Health" if we select one of the stopped instances of NodeManager, then attempt to manually start it the above pastebin logs are what we see. there is no log directory created, and thus no stderr. Cloudera Manager waits for it to start but eventually marks it failed. We are planning to deploy HA to production, and this is our test run on the QA lab system. This failure is now blocking us from proceeding with our production HA deployment. Frankly we don't even use Yarn (or MapReduce). At this point we only use HDFS and Impala. Yarn seems to be a dependency for Hive and Impala. If we are not using Yarn/MR and we can decomission these 4 failed NM instances, can the system run with a single HA pair of RM with just one instance of NM? (it would at least make Cloudera Manger happy with green status and no failures upon cluster restarts) Thanks.

ScottChris · ‎12-13-2016

# uname -a Linux hostname 2.6.32-642.6.2.el6.x86_64 #1 SMP Mon Oct 24 10:22:33 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Version: Cloudera Express 5.8.2 (#17 built by jenkins on 20160916-1426 git: d23c620f3a3bbd85d8511d6ebba49beaaab14b75) CDH 5 5.8.2-1.cdh5.8.2.p0.3 Distributed, Activated While reconfiguring for high availability now 4 of 5 NodeManagers won't start. there is no stderr file. http://pastebin.com/iu4hR03Q http://pastebin.com/8StbBsj4 I've tried removing the roles, then re-adding the roles. deleted all files in: /var/lib/hadoop-yarn/yarn-nm-recovery/ /var/yarn/ confirmed owners matched working node. no luck so far.

ScottChris · ‎12-08-2016

increased catalog server heap resolved this problem. however, there should be a jira opened against impala daemon. if the catalog server misbehaves, impala daemon should not have queries stuck 'in flight' forever, along with consuming one cpu at 100%. (consumes an entire cpu for every stuck query)

Online	Offline
Last Visited	‎07-12-2018 04:18 PM

Member Since	‎04-08-2016 10:01 AM
Last Visited	‎07-12-2018 04:18 PM
Posts	29
Kudos received	2

Cloudera Community

Re: Non DFS Used is reported much bigger than bash...

Re: Unable to start YARN - Error starting NodeMana...

Re: impala-shell operations getting stuck, spinnin...

Re: error processing the impalad catalog update. R...

Re: Hive Metastore fails to start properly after r...

Re: after -setrep from 3 to 2, Over-replicated blo...

after -setrep from 3 to 2, Over-replicated blocks ...

Re: Non DFS Used is reported much bigger than bash...

Non DFS Used is reported much bigger than bash 'df...

Re: Unable to start YARN - Error starting NodeMana...

Re: Unable to start YARN - Error starting NodeMana...

Re: Unable to start YARN - Error starting NodeMana...

Re: Unable to start YARN - Error starting NodeMana...

Unable to start YARN - Error starting NodeManagers

Re: impala-shell operations getting stuck, spinnin...