Member since
12-01-2016
25
Posts
1
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1751 | 03-26-2017 03:54 AM |
08-04-2017
12:38 PM
@Jay SenSharma Thanks for getting back on this, the details of Ambari Agent as below ]$ ambari-agent --version
2.2.1.0
]$ rpm -qa|grep ambari-agent
ambari-agent-2.2.1.0-161.x86_64 Its does seem like , the issue indicated in the Jira is relevant to the issue that occurred. As of now this issue has occurred only once but it does seem like migrating would be a good option to avoid this issue in future. Also, i had indicated that Namenode CPU WIO was N/A, after a few hours i am able to see the metric on the Dashboard.
... View more
08-04-2017
10:58 AM
The issue started with an Alert on Hive Metastore Service: Metastore on dh01.int.belong.com.au failed (Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/common-services/HIVE/0.12.0.2.0/package/alerts/alert_hive_metastore.py", line 183, in execute
timeout=int(check_command_timeout) )
File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 154, in __init__
self.env.run()
File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 158, in run
self.run_action(resource, action)
File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 121, in run_action
provider_action()
File "/usr/lib/python2.6/site-packages/resource_management/core/providers/system.py", line 238, in action_run
tries=self.resource.tries, try_sleep=self.resource.try_sleep)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 70, in inner
result = function(command, **kwargs)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 92, in checked_call
tries=tries, try_sleep=try_sleep)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 140, in _call_wrapper
result = _call(command, **kwargs_copy)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 291, in _call
raise Fail(err_msg)
Fail: Execution of 'export HIVE_CONF_DIR='/usr/hdp/current/hive-metastore/conf/conf.server' ; hive --hiveconf hive.metastore.uris=thrift://dh01.int.belong.com.au:9083 --hiveconf hive.metastore.client.connect.retry.delay=1 --hiveconf hive.metastore.failure.retries=1 --hiveconf hive.metastore.connect.retries=1 --hiveconf hive.metastore.client.socket.timeout=14 --hiveconf hive.execution.engine=mr -e 'show databases;'' returned 5. Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000002c0000000, 977797120, 0) failed; error='Cannot allocate memory' (errno=12)
Unable to determine Hadoop version information.
'hadoop version' returned:
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000002c0000000, 977797120, 0) failed; error='Cannot allocate memory' (errno=12)
# # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 977797120 bytes for committing reserved memory. # An error report file with more information is saved as: # /home/ambari-qa/hs_err_pid4858.log
)
I tried launching hive from command prompt : sudo hive , this error ed out with Java Run time Environment Exception. Then, i looked at memory utilization which indicated that SWAP has run out. ]$ free -m total used free shared buffers cached
Mem : 64560 63952 607 0 77 565
-/+ buffers/cache: 63309 1251
Swap : 1023 1023 0
I tried to restart Hive Metastore service from Ambari but that operation Hung for over 30 minutes without printing anything in the stdout and strerror logs. At this point I involved Server Administrator in the investigation and it was revealed that the following process had reserved upto 40 GB. It seemed strange (I am not sure what is the optimal utilization pattern for Ambari Agent/Monitor ?? !! ) root 3424 3404 14 2016 ? 52-22:05:00 /usr/bin/python2 /usr/lib/python2.6/site-packages/ambari_agent/main.py start At this point i tried to restart Ambari Metric service on the name node from Ambari, the operation Timed out and then "Heart Beat" from the node stopped. As can be seen in the image. I was not able to restart Ambari Metric service on the Name Node from Ambari Console, as the option was disabled. I tried to so a rolling restart of all Ambari Monitor Services, but the Monitor Service on Name Node did not start. At this point we decided to 2 things, add more swap space (Admin added 1 more GB ) and then i stopped and started Ambari Services as follows: #Stop operation did not succed at first go and i had to kill the Pid
sudo su - ams -c '/usr/sbin/ambari-metrics-monitor --config /etc/ambari-metrics-monitor/conf stop'
sudo su - ams -c '/usr/sbin/ambari-metrics-monitor --config /etc/ambari-metrics-monitor/conf start'
#I looked at Agent Status
sudo ambari-agent status#The agent was not running, Hence i started the agent
sudo ambari-agent start
After the agent start the monitor from this node was up and reflected in Ambari. The only issue that i have now is that Namenode CPU WIO is N/A on the Ambari Dashboard ? , Will be helpfull to know how to get this back ? Also, what i intend to do is to review HiveServer2 and Metastore heap sizes which current stand at, again would these settings cause this issue were swap runs out. This has not happened before ! HiveServer2 Heap Size = 20480 MB Metastore Heap Size = 12288 MB
Environment Information:
Hadoop 2.7.1.2.4.0.0-169
hive-meta-store - 2.4.0.0-169
hive-server2 - 2.4.0.0-169
hive-webhcat - 2.4.0.0-169
Ambari 2.2.1.0RAM: 64 GB Helpfull links: https://community.hortonworks.com/questions/15862/how-can-i-start-my-ambari-heartbeat.html https://cwiki.apache.org/confluence/display/AMBARI/Metrics
... View more
Labels:
- Labels:
-
Apache Ambari
07-21-2017
07:32 AM
@mqureshi .. Thanks for getting back. I have reduced the HiveServer2 Heap Size to 20 GB and observing the behavior, i intend to reduce to 12 GB ,step wise over the coming days.
... View more
07-19-2017
07:48 AM
I am facing hive errors intermittently,
Garbage Collection Issues indicated in the log:
hiveserver2: @dh01 hive]$ cat hiveserver2.log | grep 'GC'
at org.apache.tez.dag.api.client.DAGClientHandler.submitDAG(DAGClientHandler.java:118)
at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.submitDAG(DAGClientAMProtocolBlockingPBServerImpl.java:163)
at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7471)
2017-07-17 14:00:22,815 INFO [org.apache.hadoop.util.JvmPauseMonitor$Monitor@59fc6d05]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(195)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1913ms
GC pool 'PS Scavenge' had collection(s): count=1 time=1961ms
2017-07-17 14:14:28,531 INFO [org.apache.hadoop.util.JvmPauseMonitor$Monitor@59fc6d05]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(195)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1452ms
GC pool 'PS Scavenge' had collection(s): count=1 time=1701ms
2017-07-17 15:04:32,309 INFO [org.apache.hadoop.util.JvmPauseMonitor$Monitor@59fc6d05]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(195)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1838ms
GC pool 'PS Scavenge' had collection(s): count=1 time=2195ms
2017-07-17 16:08:45,121 INFO [org.apache.hadoop.util.JvmPauseMonitor$Monitor@59fc6d05]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(195)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1568ms
GC pool 'PS Scavenge' had collection(s): count=1 time=1707ms
hivemetastore: @dh01 hive]$ cat hivemetastore.log | grep -i "GC pool"
GC pool 'PS Scavenge' had collection(s): count=1 time=3521ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=11097ms
GC pool 'PS Scavenge' had collection(s): count=1 time=37ms
@dh01 hive]$ cat hivemetastore.log | grep -i "JvmPauseMonitor"
2017-07-19 04:26:50,008 INFO [org.apache.hadoop.util.JvmPauseMonitor$Monitor@4f85aca0]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(195)) - Detected pause in JVM or host machine (eg GC): pause of approximately 3050ms
2017-07-19 11:01:32,392 WARN [org.apache.hadoop.util.JvmPauseMonitor$Monitor@4f85aca0]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(191)) - Detected pause in JVM or host machine (eg GC): pause of approximately 10915ms
HiveServer2 Heap Size = 24210 MB (had been set already)
Metastore Heap Size = 12288 MB (changed from 8 GB previously).
Client heap Size= 2 GB (changed from 1 GB previously). I did read the article below and the provided links, which was helpfull: https://community.hortonworks.com/articles/14170/namenode-garbage-collection-configuration-best-pra.html but after having made the changes to indicated heap sizes , i still had instances were Hiveserver2 or Metastore service would go on alert in ambari for a few seconds and come back healthy. The logs , did not have any errors in this instance hive.out hive.log hive-server2.out hive-server2.log hivemetastore.log hiveserver2.log Am i missing something ?, would setting HiveServer2 Heap Size and Metastore Heap Size Same help.. i.e setting (HiveServer2 Heap Size =12288 MB) Environment: Hadoop 2.7.1.2.4.0.0-169
hive-meta-store - 2.4.0.0-169
hive-server2 - 2.4.0.0-169
hive-webhcat - 2.4.0.0-169
Ambari 2.2.1.0
... View more
Labels:
- Labels:
-
Apache Hive
06-29-2017
11:21 PM
Hi @ssathish, I did look at the Link you posted and decided to delete the file.
CAUTION:
For some reason a few hours later there were inconsistencies in the cluster . One of the data nodes (D5) were clean up was done had corruption in the way containers were processed. Some jobs for which containers were lunched in D5 executed to completion successfully and some other jobs failed due to Vertex failed error. We could not find any errors in RM log/Datanode Log/Node Manager Log We had to remove D5 off the cluster and reinstall node manager to set things right.
... View more
06-26-2017
02:29 AM
I have a disk running full on one of my Data node:
[ayguha@dh03 hadoop]$ sudo du -h --max-depth=1
674G ./hdfs
243G ./yarn
916G .
[xx@dh03 local]$ sudo du -h --max-depth=1
1.4G ./filecache
3.2G ./usercache
68K ./nmPrivate
242G .
There are over 1k tmp files accumulating in /data/hadoop/yarn/local [ayguha@dh03 local]$ ls -l *.tmp | wc -l
1055
./optimized-preview-record-buffer-2808068b-4d54-492e-a31a-385065d25a408826610818023522318.tmp
./preview-record-buffer-24a7477f-01f0-427e-a032-54866df48b197825057363055390034.tmp
./preview-record-buffer-b22020bb-6ec2-4f73-9d65-65dbba50136e527236496621902098.tmp
[ayguha@dh03 local]$ find ./*preview-record-buffer* -type f -mtime +90 | wc -l
973 There are near 1k files that are older than 3 months . Is it safe to delete these files ? ENV:
Hadoop 2.7.1.2.4.0.0-169
HDP 2.4
... View more
Labels:
- Labels:
-
Apache YARN
05-29-2017
06:17 AM
@mqureshi
The cluster currently only has one active name node.
Is there a better way to find out the 'Active Node' ?
I used the following as well.. but does not distinguish
curl --user admin:admin http://dh01.int.belong.com.au:8080/api/v1/clusters/belong1/host_components?HostRoles/component_name=NAMENODE&metrics/dfs/FSNamesystem/HAState=active dh01 ~]$ curl --user admin:admin http://dh01.int.belong.com.au:8080/api/v1/clusters/belong1/host_components?HostRoles/component_name=NAMENODE&metrics/dfs/FSNamesystem/HAState=active
[1] 16533
-bash: metrics/dfs/FSNamesystem/HAState=active: No such file or directory
[ayguha@dh01 ~]$ {
"href" : "http://dh01.int.belong.com.au:8080/api/v1/clusters/belong1/host_components?HostRoles/component_name=NAMENODE",
"items" : [
{
"href" : "http://dh01.int.belong.com.au:8080/api/v1/clusters/belong1/hosts/dh01.int.belong.com.au/host_components/NAMENODE",
"HostRoles" : {
"cluster_name" : "belong1",
"component_name" : "NAMENODE",
"host_name" : "dh01.int.belong.com.au"
},
"host" : {
"href" : "http://dh01.int.belong.com.au:8080/api/v1/clusters/belong1/hosts/dh01.int.belong.com.au"
}
},
{
"href" : "http://dh01.int.belong.com.au:8080/api/v1/clusters/belong1/hosts/dh02.int.belong.com.au/host_components/NAMENODE",
"HostRoles" : {
"cluster_name" : "belong1",
"component_name" : "NAMENODE",
"host_name" : "dh02.int.belong.com.au"
},
"host" : {
"href" : "http://dh01.int.belong.com.au:8080/api/v1/clusters/belong1/hosts/dh02.int.belong.com.au"
}
}
]
}
Also hdfs-site.xml does not have the property dfs.namenode.rpc-address.
... View more
05-29-2017
05:36 AM
@mqureshi Command: tried it directly without pushing it to background
sudo -u hdfs hdfs balancer -fs hdfs://belongcluster1:8020 -threshold 5
[ayguha@dh01 ~]$ sudo -u hdfs hdfs balancer -fs hdfs://belongcluster1:8020 -threshold 5
17/05/29 15:29:39 INFO balancer.Balancer: Using a threshold of 5.0
17/05/29 15:29:39 INFO balancer.Balancer: namenodes = [hdfs://belongcluster1, hdfs://belongcluster1:8020]
17/05/29 15:29:39 INFO balancer.Balancer: parameters = Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 5.0, max idle iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, #blockpools = 0, run during upgrade = false]
17/05/29 15:29:39 INFO balancer.Balancer: included nodes = []
17/05/29 15:29:39 INFO balancer.Balancer: excluded nodes = []
17/05/29 15:29:39 INFO balancer.Balancer: source nodes = []
Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved
17/05/29 15:29:41 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
17/05/29 15:29:41 INFO block.BlockTokenSecretManager: Setting block keys
17/05/29 15:29:41 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec
17/05/29 15:29:42 INFO block.BlockTokenSecretManager: Setting block keys
17/05/29 15:29:42 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
17/05/29 15:29:42 INFO block.BlockTokenSecretManager: Setting block keys
17/05/29 15:29:42 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec
java.io.IOException: Another Balancer is running.. Exiting ...
May 29, 2017 3:29:42 PM Balancing took 3.035 seconds
Error: 17/05/29 15:29:42 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec
java.io.IOException: Another Balancer is running.. Exiting ... Also checked if balancer process is stuck.. from the output it does not look like anything is hanging from previous tries. dh01 ~]$ ps -ef | grep "balancer"
ayguha 4611 2551 0 15:34 pts/0 00:00:00 grep balancer
dh01 ~]$hdfs dfs -ls /system/balancer.id
ls: `/system/balancer.id': No such file or directory
... View more
05-29-2017
03:40 AM
@mqureshi
I found another thread with similar issue:
https://community.hortonworks.com/questions/22105/hdfs-balancer-is-getting-failed-after-30-mins-in-a.html
here they say indicate that if HA is enabled then one would need to remove dfs.namenode.rpc-address .
I ran a check on Ambari Server using the configs.sh:
/var/lib/ambari-server/resources/scripts/configs.sh -u admin -p admin -port 8080 get dh01.int.belong.com.au belong1 hdfs-site and the output does not contain the dfs.namenode.rpc-address property. ########## Performing 'GET' on (Site:hdfs-site, Tag:version1470359698835)
"properties" : {
"dfs.block.access.token.enable" : "true",
"dfs.blockreport.initialDelay" : "120",
"dfs.blocksize" : "134217728",
"dfs.client.block.write.replace-datanode-on-failure.enable" : "NEVER",
"dfs.client.failover.proxy.provider.belongcluster1" : "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
"dfs.client.read.shortcircuit" : "true",
"dfs.client.read.shortcircuit.streams.cache.size" : "4096",
"dfs.client.retry.policy.enabled" : "false",
"dfs.cluster.administrators" : " hdfs",
"dfs.content-summary.limit" : "5000",
"dfs.datanode.address" : "0.0.0.0:50010",
"dfs.datanode.balance.bandwidthPerSec" : "6250000",
"dfs.datanode.data.dir" : "/data/hadoop/hdfs/data",
"dfs.datanode.data.dir.perm" : "750",
"dfs.datanode.du.reserved" : "1073741824",
"dfs.datanode.failed.volumes.tolerated" : "0",
"dfs.datanode.http.address" : "0.0.0.0:50075",
"dfs.datanode.https.address" : "0.0.0.0:50475",
"dfs.datanode.ipc.address" : "0.0.0.0:8010",
"dfs.datanode.max.transfer.threads" : "16384",
"dfs.domain.socket.path" : "/var/lib/hadoop-hdfs/dn_socket",
"dfs.encrypt.data.transfer.cipher.suites" : "AES/CTR/NoPadding",
"dfs.encryption.key.provider.uri" : "",
"dfs.ha.automatic-failover.enabled" : "true",
"dfs.ha.fencing.methods" : "shell(/bin/true)",
"dfs.ha.namenodes.belongcluster1" : "nn1,nn2",
"dfs.heartbeat.interval" : "3",
"dfs.hosts.exclude" : "/etc/hadoop/conf/dfs.exclude",
"dfs.http.policy" : "HTTP_ONLY",
"dfs.https.port" : "50470",
"dfs.journalnode.edits.dir" : "/hadoop/hdfs/journal",
"dfs.journalnode.https-address" : "0.0.0.0:8481",
"dfs.namenode.accesstime.precision" : "0",
"dfs.namenode.acls.enabled" : "true",
"dfs.namenode.audit.log.async" : "true",
"dfs.namenode.avoid.read.stale.datanode" : "true",
"dfs.namenode.avoid.write.stale.datanode" : "true",
"dfs.namenode.checkpoint.dir" : "/tmp/hadoop/hdfs/namesecondary",
"dfs.namenode.checkpoint.edits.dir" : "${dfs.namenode.checkpoint.dir}",
"dfs.namenode.checkpoint.period" : "21600",
"dfs.namenode.checkpoint.txns" : "1000000",
"dfs.namenode.fslock.fair" : "false",
"dfs.namenode.handler.count" : "200",
"dfs.namenode.http-address" : "dh01.int.belong.com.au:50070",
"dfs.namenode.http-address.belongcluster1.nn1" : "dh01.int.belong.com.au:50070",
"dfs.namenode.http-address.belongcluster1.nn2" : "dh02.int.belong.com.au:50070",
"dfs.namenode.https-address" : "dh01.int.belong.com.au:50470",
"dfs.namenode.https-address.belongcluster1.nn1" : "dh01.int.belong.com.au:50470",
"dfs.namenode.https-address.belongcluster1.nn2" : "dh02.int.belong.com.au:50470",
"dfs.namenode.name.dir" : "/data/hadoop/hdfs/namenode",
"dfs.namenode.name.dir.restore" : "true",
"dfs.namenode.rpc-address.belongcluster1.nn1" : "dh01.int.belong.com.au:8020",
"dfs.namenode.rpc-address.belongcluster1.nn2" : "dh02.int.belong.com.au:8020",
"dfs.namenode.safemode.threshold-pct" : "0.99",
"dfs.namenode.shared.edits.dir" : "qjournal://dh03.int.belong.com.au:8485;dh02.int.belong.com.au:8485;dh01.int.belong.com.au:8485/belongcluster1",
"dfs.namenode.stale.datanode.interval" : "30000",
"dfs.namenode.startup.delay.block.deletion.sec" : "3600",
"dfs.namenode.write.stale.datanode.ratio" : "1.0f",
"dfs.nameservices" : "belongcluster1",
"dfs.permissions.enabled" : "true",
"dfs.permissions.superusergroup" : "hdfs",
"dfs.replication" : "3",
"dfs.replication.max" : "50",
"dfs.support.append" : "true",
"dfs.webhdfs.enabled" : "true",
"fs.permissions.umask-mode" : "022",
"nfs.exports.allowed.hosts" : "* rw",
"nfs.file.dump.dir" : "/tmp/.hdfs-nfs"
}
Are you suggesting that i just keep 1 namenode service address and point it to primary name node host:port. Something like the below: <property>
<name>dfs.namenode.rpc-address.belongcluster1</name>
<value>dh01.int.belong.com.au:8020</value>
</property>
... View more
05-29-2017
02:39 AM
@mqureshi About : https://community.hortonworks.com/articles/4595/balancer-not-working-in-hdfs-ha.html my hdfs-site.xml has 2 entries .. i am not sure if i need to delete both or NN2 only.. <property>
<name>dfs.namenode.rpc-address.belongcluster1.nn1</name>
<value>dh01.int.belong.com.au:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.belongcluster1.nn2</name>
<value>dh02.int.belong.com.au:8020</value>
</property>
... View more