Member since
06-01-2017
87
Posts
11
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
985 | 11-24-2016 12:23 PM | |
869 | 10-13-2016 01:55 PM |
02-10-2017
07:45 AM
@Jay SenSharma no "MetaSpace" can't be found in my env. # ps -ef |grep ^ams|grep HMaster
ams 74758 74744 99 Feb09 ? 1-14:46:59 /usr/local/java/jdk1.8.0_101/bin/java -Dproc_master -XX:OnOutOfMemoryError=kill -9 %p -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/ambari-metrics-collector/hs_err_pid%p.log -Djava.io.tmpdir=/usr/var/lib/ambari-metrics-collector/hbase-tmp -Djava.library.path=/usr/lib/ams-hbase/lib/hadoop-native/ -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/ambari-metrics-collector/gc.log-201702090946 -Xms16384m -Xmx16384m -Xmn320m -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -Dhbase.log.dir=/var/log/ambari-metrics-collector -Dhbase.log.file=hbase-ams-master-humepcomp117.log -Dhbase.home.dir=/usr/lib/ams-hbase/ -Dhbase.id.str=ams -Dhbase.root.logger=INFO,RFA -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.master.HMaster start
# jmap -heap 74758
Attaching to process ID 74758, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.101-b13
using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 17179869184 (16384.0MB)
NewSize = 335544320 (320.0MB)
MaxNewSize = 335544320 (320.0MB)
OldSize = 16844324864 (16064.0MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 0 (0.0MB)
Heap Usage:
New Generation (Eden + 1 Survivor Space):
capacity = 301989888 (288.0MB)
used = 237059568 (226.07762145996094MB)
free = 64930320 (61.92237854003906MB)
78.49917411804199% used
Eden Space:
capacity = 268435456 (256.0MB)
used = 203505136 (194.07762145996094MB)
free = 64930320 (61.92237854003906MB)
75.81157088279724% used
From Space:
capacity = 33554432 (32.0MB)
used = 33554432 (32.0MB)
free = 0 (0.0MB)
100.0% used
To Space:
capacity = 33554432 (32.0MB)
used = 0 (0.0MB)
free = 33554432 (32.0MB)
0.0% used
concurrent mark-sweep generation:
capacity = 16844324864 (16064.0MB)
used = 10044724000 (9579.395294189453MB)
free = 6799600864 (6484.604705810547MB)
59.63268982936662% used
16850 interned Strings occupying 1609888 bytes.
... View more
02-10-2017
03:44 AM
@Jay SenSharma Thanks for your point. So post Ambari 2.4.0.1 , it is same MOVE AMS in Web UI or follow the steps in documentory?
... View more
02-10-2017
03:37 AM
currently, I had extend the datanodes on nodemanagers alreay by installing DISKs. so it is 40 datanodes and 19 Nodemanagers now. but still have the same issue. Is one datanode to one nodemanager as best practice?
... View more
02-10-2017
03:28 AM
I'd like to share a trap of MOVE AMS. In Ambari 2.4.0.1, There is "Move Metrics Collector" in the Web UI. And I had attempted, the trap is during the wizard, it not mentioned the whole cluster will restart, but it do restarted. ...... later, I found http://docs.hortonworks.com/HDPDocuments/Ambari-2.4.0.1/bk_ambari-user-guide/content/moving_the_metrics_collector.html But not try it, and don't know if this also will make cluster restart.
... View more
02-10-2017
03:19 AM
Hi all, my last change on following seems solved issue, the alter not happened for more than 24 hours. ams-hbase-env::hbase_master_heapsize
1152 MB ===>> 8192 MB
ams-hbase-env::hbase_master_maxperm_size
128 MB ===>> 8192 MB
ams-hbase-env::hbase_regionserver_heapsize
768 MB ===>> 8192 MB
ams-hbase-env::regionserver_xmn_size
128 MB ===>> 256 MB ams-site::phoenix.query.maxGlobalMemoryPercentage
25 ===>> 30
ams-site::timeline.metrics.cluster.aggregate.splitpoints
mapred.ShuffleMetrics.ShuffleOutputsFailed ===>> dfs.namenode.FilesInGetListingOps,kafka.network.RequestMetrics.RequestsPerSec.request.LeaderAndIsr.count,master.Master.numActiveHandler,regionserver.Server.checkMutatePassedCount
ams-site::timeline.metrics.host.aggregate.splitpoints
mapred.ShuffleMetrics.ShuffleOutputsFailed ===>> EventDrainSuccessCount,bytes_out,default.StartupProgress.LoadingEditsPercentComplete,dfs.datanode.BlockChecksumOpAvgTime,dfs.datanode.ReplaceBlockOpAvgTime,dfs.namenode.SyncsNumOps,ipc.IPC.QueueCallTime_num_ops,jvm.JvmMetrics.MemHeapMaxM,kafka.controller.ControllerStats.LeaderElectionRateAndTimeMs.95percentile,kafka.network.RequestMetrics.RequestsPerSec.request.ConsumerMetadata.5MinuteRate,kafka.network.RequestMetrics.RequestsPerSec.request.JoinGroup.5MinuteRate,kafka.network.RequestMetrics.RequestsPerSec.request.Produce.5MinuteRate,kafka.network.SocketServer.IdlePercent.networkProcessor.2.meanRate,load_one,master.Balancer.BalancerCluster_median,master.FileSystem.MetaHlogSplitTime_75th_percentile,master.Server.averageLoad,metricssystem.MetricsSystem.NumActiveSources,read_bps,regionserver.Replication.sink.ageOfLastAppliedOp,regionserver.Server.Increment_75th_percentile,regionserver.Server.SplitTime_max,regionserver.Server.splitSuccessCount,rpc.rpc.SentBytes,rpcdetailed.rpcdetailed.GetQueueUserAclsAvgTime,write_bytes
... View more
02-09-2017
01:50 AM
@icocio my cluster have 40 hosts, I now tuned parameters as >50 , I will monitor if this will help my problem.
... View more
02-09-2017
01:41 AM
40 hosts in my cluster @swagle
... View more
02-09-2017
01:37 AM
Would anyone have a look at this post? Actully, the problem from following post. https://community.hortonworks.com/questions/76973/hive-performance-bad-on-higher-configured-cluster.html#comment-76998
... View more
02-07-2017
02:02 PM
Apache Ambari
Version2.4.0.1 The alterts indicated CRITIAL because of Metrics Collector has been auto-started 5 times since <timestamp> This happened very very frequently. The mainly log ambari-metrics-collector.loglike below 2017-02-07 11:48:42,465 WARN org.apache.zookeeper.ClientCnxn: Session 0x15a1698abc40001 for server humepcomp117.huawei.com/10.106.134.117:61181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
.
.
2017-02-07 11:48:45,336 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=humepcomp117.huawei.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/table/SYSTEM.CATALOG
2017-02-07 11:48:45,336 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=humepcomp117.huawei.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure
.
.
2017-02-07 11:49:02,678 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper getData failed after 4 attempts
2017-02-07 11:49:02,678 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: hconnection-0x433defed-0x15a1698abc40001, quorum=humepcomp117.huawei.com:61181, baseZNode=/ams-hbase-unsecure Unable to get data of znode /ams-hbase-unsecure/table/METRIC_RECORD
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/table/METRIC_RECORD
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:622)
at org.apache.hadoop.hbase.zookeeper.ZKTableStateClientSideReader.getTableState(ZKTableStateClientSideReader.java:185)
at org.apache.hadoop.hbase.zookeeper.ZKTableStateClientSideReader.isDisabledTable(ZKTableStateClientSideReader.java:59)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.isTableOnlineState(ZooKeeperRegistry.java:127)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.isTableDisabled(ConnectionManager.java:960)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.relocateRegion(ConnectionManager.java:1129)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:298)
at org.apache.hadoop.hbase.client.ScannerCallable.prepare(ScannerCallable.java:150)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:376)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124)
at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
and hbase-ams-master-humepcomp117.log 2017-02-07 11:48:43,542 INFO [main] zookeeper.ZooKeeper: Initiating client connection, connectString=humepcomp117.huawei.com:61181 sessionTimeout=120000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@4450d156
2017-02-07 11:48:43,583 INFO [main-SendThread(humepcomp117.huawei.com:61181)] zookeeper.ClientCnxn: Opening socket connection to server humepcomp117.huawei.com/10.106.134.117:61181. Will not attempt to authenticate using SASL (unknown error)
2017-02-07 11:48:43,592 WARN [main-SendThread(humepcomp117.huawei.com:61181)] zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2017-02-07 11:48:43,712 WARN [main] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=humepcomp117.huawei.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/master
2017-02-07 11:48:44,702 INFO [main-SendThread(humepcomp117.huawei.com:61181)] zookeeper.ClientCnxn: Opening socket connection to server humepcomp117.huawei.com/10.106.134.117:61181. Will not attempt to authenticate using SASL (unknown error)
2017-02-07 11:48:44,704 WARN [main-SendThread(humepcomp117.huawei.com:61181)] zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2017-02-07 11:48:44,805 WARN [main] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=humepcomp117.huawei.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/master
2017-02-07 11:48:44,805 ERROR [main] zookeeper.RecoverableZooKeeper: ZooKeeper getData failed after 1 attempts
2017-02-07 11:48:44,805 WARN [main] zookeeper.ZKUtil: clean znode for master0x0, quorum=humepcomp117.huawei.com:61181, baseZNode=/ams-hbase-unsecure Unable to get data of znode /ams-hbase-unsecure/master
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/master
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:712)
at org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.deleteIfEquals(MasterAddressTracker.java:267)
at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:149)
at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2529)
2017-02-07 11:48:44,807 ERROR [main] zookeeper.ZooKeeperWatcher: clean znode for master0x0, quorum=humepcomp117.huawei.com:61181, baseZNode=/ams-hbase-unsecure Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/master
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:712)
at org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.deleteIfEquals(MasterAddressTracker.java:267)
at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:149)
at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2529)
2017-02-07 11:48:44,809 WARN [main] zookeeper.ZooKeeperNodeTracker: Can't get or delete the master znode
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/master
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:712)
at org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.deleteIfEquals(MasterAddressTracker.java:267)
at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:149)
at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2529)
2017-02-07 11:48:45,805 INFO [main-SendThread(humepcomp117.huawei.com:61181)] zookeeper.ClientCnxn: Opening socket connection to server humepcomp117.huawei.com/10.106.134.117:61181. Will not attempt to authenticate using SASL (unknown error)
2017-02-07 11:48:45,911 INFO [main] zookeeper.ZooKeeper: Session: 0x0 closed
2017-02-07 11:48:45,911 INFO [main-EventThread] zookeeper.ClientCnxn: EventThread shut down Due to networklimitations of the company, I can't paste or upload full log file, I will try to paste in Follow-up posts. Please kindly help to solve this problem.Thanks. I had tryied the best answer in https://community.hortonworks.com/questions/48107/ambari-metrics-collector.html , but no help. Thanks for your great support.
... View more
Labels:
- Labels:
-
Apache Ambari
02-07-2017
07:30 AM
I have two clusters, UAT and PROD. The UAT have more less resources than PROD. But I notice that there almost have no PENDING stage on UAT when run HIVE QL, while the PENDING containers for a little long time on PROD like below: hive> select count(1) from humep.ems_barcode_material_ption_h;
Query ID = root_20170111172857_3f3057c0-a819-4b2d-9881-9915f2e80216
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1483672680049_59226)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 ... RUNNING 1405 494 108 803 0 0
Reducer 2 INITED 1 0 0 1 0 0
--------------------------------------------------------------------------------
VERTICES: 00/02 [=========>>-----------------] 35% ELAPSED TIME: 36.68 s
--------------------------------------------------------------------------------
Is there any method that can promote the parallel RUNNING ? I tried
set hive.exec.parallel=true; set hive.exec.parallel.thread.number=8; but no effect. Thanks for your great help and support
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache YARN