About suhel_khan

suhel_khan · ‎08-04-2017

@Jay SenSharma Thanks for getting back on this, the details of Ambari Agent as below ]$ ambari-agent --version 2.2.1.0 ]$ rpm -qa|grep ambari-agent ambari-agent-2.2.1.0-161.x86_64 Its does seem like , the issue indicated in the Jira is relevant to the issue that occurred. As of now this issue has occurred only once but it does seem like migrating would be a good option to avoid this issue in future. Also, i had indicated that Namenode CPU WIO was N/A, after a few hours i am able to see the metric on the Dashboard.

suhel_khan · ‎08-04-2017

The issue started with an Alert on Hive Metastore Service: Metastore on dh01.int.belong.com.au failed (Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/HIVE/0.12.0.2.0/package/alerts/alert_hive_metastore.py", line 183, in execute timeout=int(check_command_timeout) ) File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 154, in __init__ self.env.run() File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 158, in run self.run_action(resource, action) File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 121, in run_action provider_action() File "/usr/lib/python2.6/site-packages/resource_management/core/providers/system.py", line 238, in action_run tries=self.resource.tries, try_sleep=self.resource.try_sleep) File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 70, in inner result = function(command, **kwargs) File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 92, in checked_call tries=tries, try_sleep=try_sleep) File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 140, in _call_wrapper result = _call(command, **kwargs_copy) File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 291, in _call raise Fail(err_msg) Fail: Execution of 'export HIVE_CONF_DIR='/usr/hdp/current/hive-metastore/conf/conf.server' ; hive --hiveconf hive.metastore.uris=thrift://dh01.int.belong.com.au:9083 --hiveconf hive.metastore.client.connect.retry.delay=1 --hiveconf hive.metastore.failure.retries=1 --hiveconf hive.metastore.connect.retries=1 --hiveconf hive.metastore.client.socket.timeout=14 --hiveconf hive.execution.engine=mr -e 'show databases;'' returned 5. Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000002c0000000, 977797120, 0) failed; error='Cannot allocate memory' (errno=12) Unable to determine Hadoop version information. 'hadoop version' returned: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000002c0000000, 977797120, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 977797120 bytes for committing reserved memory. # An error report file with more information is saved as: # /home/ambari-qa/hs_err_pid4858.log ) I tried launching hive from command prompt : sudo hive , this error ed out with Java Run time Environment Exception. Then, i looked at memory utilization which indicated that SWAP has run out. ]$ free -m total used free shared buffers cached Mem : 64560 63952 607 0 77 565 -/+ buffers/cache: 63309 1251 Swap : 1023 1023 0 I tried to restart Hive Metastore service from Ambari but that operation Hung for over 30 minutes without printing anything in the stdout and strerror logs. At this point I involved Server Administrator in the investigation and it was revealed that the following process had reserved upto 40 GB. It seemed strange (I am not sure what is the optimal utilization pattern for Ambari Agent/Monitor ?? !! ) root 3424 3404 14 2016 ? 52-22:05:00 /usr/bin/python2 /usr/lib/python2.6/site-packages/ambari_agent/main.py start At this point i tried to restart Ambari Metric service on the name node from Ambari, the operation Timed out and then "Heart Beat" from the node stopped. As can be seen in the image. I was not able to restart Ambari Metric service on the Name Node from Ambari Console, as the option was disabled. I tried to so a rolling restart of all Ambari Monitor Services, but the Monitor Service on Name Node did not start. At this point we decided to 2 things, add more swap space (Admin added 1 more GB ) and then i stopped and started Ambari Services as follows: #Stop operation did not succed at first go and i had to kill the Pid sudo su - ams -c '/usr/sbin/ambari-metrics-monitor --config /etc/ambari-metrics-monitor/conf stop' sudo su - ams -c '/usr/sbin/ambari-metrics-monitor --config /etc/ambari-metrics-monitor/conf start' #I looked at Agent Status sudo ambari-agent status#The agent was not running, Hence i started the agent sudo ambari-agent start After the agent start the monitor from this node was up and reflected in Ambari. The only issue that i have now is that Namenode CPU WIO is N/A on the Ambari Dashboard ? , Will be helpfull to know how to get this back ? Also, what i intend to do is to review HiveServer2 and Metastore heap sizes which current stand at, again would these settings cause this issue were swap runs out. This has not happened before ! HiveServer2 Heap Size = 20480 MB Metastore Heap Size = 12288 MB Environment Information: Hadoop 2.7.1.2.4.0.0-169 hive-meta-store - 2.4.0.0-169 hive-server2 - 2.4.0.0-169 hive-webhcat - 2.4.0.0-169 Ambari 2.2.1.0RAM: 64 GB Helpfull links: https://community.hortonworks.com/questions/15862/how-can-i-start-my-ambari-heartbeat.html https://cwiki.apache.org/confluence/display/AMBARI/Metrics

suhel_khan · ‎07-21-2017

@mqureshi .. Thanks for getting back. I have reduced the HiveServer2 Heap Size to 20 GB and observing the behavior, i intend to reduce to 12 GB ,step wise over the coming days.

suhel_khan · ‎07-19-2017

I am facing hive errors intermittently, Garbage Collection Issues indicated in the log: hiveserver2: @dh01 hive]$ cat hiveserver2.log | grep 'GC' at org.apache.tez.dag.api.client.DAGClientHandler.submitDAG(DAGClientHandler.java:118) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.submitDAG(DAGClientAMProtocolBlockingPBServerImpl.java:163) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7471) 2017-07-17 14:00:22,815 INFO [org.apache.hadoop.util.JvmPauseMonitor$Monitor@59fc6d05]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(195)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1913ms GC pool 'PS Scavenge' had collection(s): count=1 time=1961ms 2017-07-17 14:14:28,531 INFO [org.apache.hadoop.util.JvmPauseMonitor$Monitor@59fc6d05]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(195)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1452ms GC pool 'PS Scavenge' had collection(s): count=1 time=1701ms 2017-07-17 15:04:32,309 INFO [org.apache.hadoop.util.JvmPauseMonitor$Monitor@59fc6d05]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(195)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1838ms GC pool 'PS Scavenge' had collection(s): count=1 time=2195ms 2017-07-17 16:08:45,121 INFO [org.apache.hadoop.util.JvmPauseMonitor$Monitor@59fc6d05]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(195)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1568ms GC pool 'PS Scavenge' had collection(s): count=1 time=1707ms hivemetastore: @dh01 hive]$ cat hivemetastore.log | grep -i "GC pool" GC pool 'PS Scavenge' had collection(s): count=1 time=3521ms GC pool 'PS MarkSweep' had collection(s): count=1 time=11097ms GC pool 'PS Scavenge' had collection(s): count=1 time=37ms @dh01 hive]$ cat hivemetastore.log | grep -i "JvmPauseMonitor" 2017-07-19 04:26:50,008 INFO [org.apache.hadoop.util.JvmPauseMonitor$Monitor@4f85aca0]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(195)) - Detected pause in JVM or host machine (eg GC): pause of approximately 3050ms 2017-07-19 11:01:32,392 WARN [org.apache.hadoop.util.JvmPauseMonitor$Monitor@4f85aca0]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(191)) - Detected pause in JVM or host machine (eg GC): pause of approximately 10915ms HiveServer2 Heap Size = 24210 MB (had been set already) Metastore Heap Size = 12288 MB (changed from 8 GB previously). Client heap Size= 2 GB (changed from 1 GB previously). I did read the article below and the provided links, which was helpfull: https://community.hortonworks.com/articles/14170/namenode-garbage-collection-configuration-best-pra.html but after having made the changes to indicated heap sizes , i still had instances were Hiveserver2 or Metastore service would go on alert in ambari for a few seconds and come back healthy. The logs , did not have any errors in this instance hive.out hive.log hive-server2.out hive-server2.log hivemetastore.log hiveserver2.log Am i missing something ?, would setting HiveServer2 Heap Size and Metastore Heap Size Same help.. i.e setting (HiveServer2 Heap Size =12288 MB) Environment: Hadoop 2.7.1.2.4.0.0-169 hive-meta-store - 2.4.0.0-169 hive-server2 - 2.4.0.0-169 hive-webhcat - 2.4.0.0-169 Ambari 2.2.1.0

suhel_khan · ‎06-29-2017

Hi @ssathish, I did look at the Link you posted and decided to delete the file. CAUTION: For some reason a few hours later there were inconsistencies in the cluster . One of the data nodes (D5) were clean up was done had corruption in the way containers were processed. Some jobs for which containers were lunched in D5 executed to completion successfully and some other jobs failed due to Vertex failed error. We could not find any errors in RM log/Datanode Log/Node Manager Log We had to remove D5 off the cluster and reinstall node manager to set things right.

suhel_khan · ‎06-26-2017

I have a disk running full on one of my Data node: [ayguha@dh03 hadoop]$ sudo du -h --max-depth=1 674G ./hdfs 243G ./yarn 916G . [xx@dh03 local]$ sudo du -h --max-depth=1 1.4G ./filecache 3.2G ./usercache 68K ./nmPrivate 242G . There are over 1k tmp files accumulating in /data/hadoop/yarn/local [ayguha@dh03 local]$ ls -l *.tmp | wc -l 1055 ./optimized-preview-record-buffer-2808068b-4d54-492e-a31a-385065d25a408826610818023522318.tmp ./preview-record-buffer-24a7477f-01f0-427e-a032-54866df48b197825057363055390034.tmp ./preview-record-buffer-b22020bb-6ec2-4f73-9d65-65dbba50136e527236496621902098.tmp [ayguha@dh03 local]$ find ./*preview-record-buffer* -type f -mtime +90 | wc -l 973 There are near 1k files that are older than 3 months . Is it safe to delete these files ? ENV: Hadoop 2.7.1.2.4.0.0-169 HDP 2.4

suhel_khan · ‎05-29-2017

@mqureshi The cluster currently only has one active name node. Is there a better way to find out the 'Active Node' ? I used the following as well.. but does not distinguish curl --user admin:admin http://dh01.int.belong.com.au:8080/api/v1/clusters/belong1/host_components?HostRoles/component_name=NAMENODE&metrics/dfs/FSNamesystem/HAState=active dh01 ~]$ curl --user admin:admin http://dh01.int.belong.com.au:8080/api/v1/clusters/belong1/host_components?HostRoles/component_name=NAMENODE&metrics/dfs/FSNamesystem/HAState=active [1] 16533 -bash: metrics/dfs/FSNamesystem/HAState=active: No such file or directory [ayguha@dh01 ~]$ { "href" : "http://dh01.int.belong.com.au:8080/api/v1/clusters/belong1/host_components?HostRoles/component_name=NAMENODE", "items" : [ { "href" : "http://dh01.int.belong.com.au:8080/api/v1/clusters/belong1/hosts/dh01.int.belong.com.au/host_components/NAMENODE", "HostRoles" : { "cluster_name" : "belong1", "component_name" : "NAMENODE", "host_name" : "dh01.int.belong.com.au" }, "host" : { "href" : "http://dh01.int.belong.com.au:8080/api/v1/clusters/belong1/hosts/dh01.int.belong.com.au" } }, { "href" : "http://dh01.int.belong.com.au:8080/api/v1/clusters/belong1/hosts/dh02.int.belong.com.au/host_components/NAMENODE", "HostRoles" : { "cluster_name" : "belong1", "component_name" : "NAMENODE", "host_name" : "dh02.int.belong.com.au" }, "host" : { "href" : "http://dh01.int.belong.com.au:8080/api/v1/clusters/belong1/hosts/dh02.int.belong.com.au" } } ] } Also hdfs-site.xml does not have the property dfs.namenode.rpc-address.

suhel_khan · ‎05-29-2017

@mqureshi Command: tried it directly without pushing it to background sudo -u hdfs hdfs balancer -fs hdfs://belongcluster1:8020 -threshold 5 [ayguha@dh01 ~]$ sudo -u hdfs hdfs balancer -fs hdfs://belongcluster1:8020 -threshold 5 17/05/29 15:29:39 INFO balancer.Balancer: Using a threshold of 5.0 17/05/29 15:29:39 INFO balancer.Balancer: namenodes = [hdfs://belongcluster1, hdfs://belongcluster1:8020] 17/05/29 15:29:39 INFO balancer.Balancer: parameters = Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 5.0, max idle iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, #blockpools = 0, run during upgrade = false] 17/05/29 15:29:39 INFO balancer.Balancer: included nodes = [] 17/05/29 15:29:39 INFO balancer.Balancer: excluded nodes = [] 17/05/29 15:29:39 INFO balancer.Balancer: source nodes = [] Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved 17/05/29 15:29:41 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec 17/05/29 15:29:41 INFO block.BlockTokenSecretManager: Setting block keys 17/05/29 15:29:41 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec 17/05/29 15:29:42 INFO block.BlockTokenSecretManager: Setting block keys 17/05/29 15:29:42 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec 17/05/29 15:29:42 INFO block.BlockTokenSecretManager: Setting block keys 17/05/29 15:29:42 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec java.io.IOException: Another Balancer is running.. Exiting ... May 29, 2017 3:29:42 PM Balancing took 3.035 seconds Error: 17/05/29 15:29:42 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec java.io.IOException: Another Balancer is running.. Exiting ... Also checked if balancer process is stuck.. from the output it does not look like anything is hanging from previous tries. dh01 ~]$ ps -ef | grep "balancer" ayguha 4611 2551 0 15:34 pts/0 00:00:00 grep balancer dh01 ~]$hdfs dfs -ls /system/balancer.id ls: `/system/balancer.id': No such file or directory

suhel_khan · ‎05-29-2017

@mqureshi I found another thread with similar issue: https://community.hortonworks.com/questions/22105/hdfs-balancer-is-getting-failed-after-30-mins-in-a.html here they say indicate that if HA is enabled then one would need to remove dfs.namenode.rpc-address . I ran a check on Ambari Server using the configs.sh: /var/lib/ambari-server/resources/scripts/configs.sh -u admin -p admin -port 8080 get dh01.int.belong.com.au belong1 hdfs-site and the output does not contain the dfs.namenode.rpc-address property. ########## Performing 'GET' on (Site:hdfs-site, Tag:version1470359698835) "properties" : { "dfs.block.access.token.enable" : "true", "dfs.blockreport.initialDelay" : "120", "dfs.blocksize" : "134217728", "dfs.client.block.write.replace-datanode-on-failure.enable" : "NEVER", "dfs.client.failover.proxy.provider.belongcluster1" : "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", "dfs.client.read.shortcircuit" : "true", "dfs.client.read.shortcircuit.streams.cache.size" : "4096", "dfs.client.retry.policy.enabled" : "false", "dfs.cluster.administrators" : " hdfs", "dfs.content-summary.limit" : "5000", "dfs.datanode.address" : "0.0.0.0:50010", "dfs.datanode.balance.bandwidthPerSec" : "6250000", "dfs.datanode.data.dir" : "/data/hadoop/hdfs/data", "dfs.datanode.data.dir.perm" : "750", "dfs.datanode.du.reserved" : "1073741824", "dfs.datanode.failed.volumes.tolerated" : "0", "dfs.datanode.http.address" : "0.0.0.0:50075", "dfs.datanode.https.address" : "0.0.0.0:50475", "dfs.datanode.ipc.address" : "0.0.0.0:8010", "dfs.datanode.max.transfer.threads" : "16384", "dfs.domain.socket.path" : "/var/lib/hadoop-hdfs/dn_socket", "dfs.encrypt.data.transfer.cipher.suites" : "AES/CTR/NoPadding", "dfs.encryption.key.provider.uri" : "", "dfs.ha.automatic-failover.enabled" : "true", "dfs.ha.fencing.methods" : "shell(/bin/true)", "dfs.ha.namenodes.belongcluster1" : "nn1,nn2", "dfs.heartbeat.interval" : "3", "dfs.hosts.exclude" : "/etc/hadoop/conf/dfs.exclude", "dfs.http.policy" : "HTTP_ONLY", "dfs.https.port" : "50470", "dfs.journalnode.edits.dir" : "/hadoop/hdfs/journal", "dfs.journalnode.https-address" : "0.0.0.0:8481", "dfs.namenode.accesstime.precision" : "0", "dfs.namenode.acls.enabled" : "true", "dfs.namenode.audit.log.async" : "true", "dfs.namenode.avoid.read.stale.datanode" : "true", "dfs.namenode.avoid.write.stale.datanode" : "true", "dfs.namenode.checkpoint.dir" : "/tmp/hadoop/hdfs/namesecondary", "dfs.namenode.checkpoint.edits.dir" : "${dfs.namenode.checkpoint.dir}", "dfs.namenode.checkpoint.period" : "21600", "dfs.namenode.checkpoint.txns" : "1000000", "dfs.namenode.fslock.fair" : "false", "dfs.namenode.handler.count" : "200", "dfs.namenode.http-address" : "dh01.int.belong.com.au:50070", "dfs.namenode.http-address.belongcluster1.nn1" : "dh01.int.belong.com.au:50070", "dfs.namenode.http-address.belongcluster1.nn2" : "dh02.int.belong.com.au:50070", "dfs.namenode.https-address" : "dh01.int.belong.com.au:50470", "dfs.namenode.https-address.belongcluster1.nn1" : "dh01.int.belong.com.au:50470", "dfs.namenode.https-address.belongcluster1.nn2" : "dh02.int.belong.com.au:50470", "dfs.namenode.name.dir" : "/data/hadoop/hdfs/namenode", "dfs.namenode.name.dir.restore" : "true", "dfs.namenode.rpc-address.belongcluster1.nn1" : "dh01.int.belong.com.au:8020", "dfs.namenode.rpc-address.belongcluster1.nn2" : "dh02.int.belong.com.au:8020", "dfs.namenode.safemode.threshold-pct" : "0.99", "dfs.namenode.shared.edits.dir" : "qjournal://dh03.int.belong.com.au:8485;dh02.int.belong.com.au:8485;dh01.int.belong.com.au:8485/belongcluster1", "dfs.namenode.stale.datanode.interval" : "30000", "dfs.namenode.startup.delay.block.deletion.sec" : "3600", "dfs.namenode.write.stale.datanode.ratio" : "1.0f", "dfs.nameservices" : "belongcluster1", "dfs.permissions.enabled" : "true", "dfs.permissions.superusergroup" : "hdfs", "dfs.replication" : "3", "dfs.replication.max" : "50", "dfs.support.append" : "true", "dfs.webhdfs.enabled" : "true", "fs.permissions.umask-mode" : "022", "nfs.exports.allowed.hosts" : "* rw", "nfs.file.dump.dir" : "/tmp/.hdfs-nfs" } Are you suggesting that i just keep 1 namenode service address and point it to primary name node host:port. Something like the below: <property> <name>dfs.namenode.rpc-address.belongcluster1</name> <value>dh01.int.belong.com.au:8020</value> </property>

suhel_khan · ‎05-29-2017

@mqureshi About : https://community.hortonworks.com/articles/4595/balancer-not-working-in-hdfs-ha.html my hdfs-site.xml has 2 entries .. i am not sure if i need to delete both or NN2 only.. <property> <name>dfs.namenode.rpc-address.belongcluster1.nn1</name> <value>dh01.int.belong.com.au:8020</value> </property> <property> <name>dfs.namenode.rpc-address.belongcluster1.nn2</name> <value>dh02.int.belong.com.au:8020</value> </property>

Online	Offline
Last Visited	‎03-22-2018 09:43 AM

Member Since	‎12-01-2016 03:42 AM
Last Visited	‎03-22-2018 09:43 AM
Posts	25
Kudos received	1

Cloudera Community

Re: Wierd Error while installing ambari-metrics - ...

Re: Unable to start Metrics Monitor on Name Node :...

Unable to start Metrics Monitor on Name Node : Hea...

Re: Should the HiveServer2 Heap Size and Metastore...

Should the HiveServer2 Heap Size and Metastore Hea...

Re: Cleaning up preview-record-buffer tmp files f...

Cleaning up preview-record-buffer tmp files from ...

Re: HDFS Balancer exits without balancing

Re: HDFS Balancer exits without balancing

Re: HDFS Balancer exits without balancing

Re: HDFS Balancer exits without balancing