About JLo_Hernandez

JLo_Hernandez · ‎10-28-2019

Hello Shelton. I've followed entire given steps and service still is not coming up. Below attached outputs. From: /var/lib/ambari-agent/data/errors-31848.txt resource_management.core.exceptions.ExecuteTimeoutException: Execution of 'ambari-sudo.sh su yarnats -l -s /bin/bash -c 'export PATH='"'"'/usr/sbin:/sbin:/usr/lib/ambari-server/*:/usr/sbin:/sbin:/usr/lib/ambari-server/*:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin:/var/lib/ambari-agent:/var/lib/ambari-agent'"'"' ; sleep 10;export HBASE_CLASSPATH_PREFIX=/usr/hdp/3.1.0.0-78/hadoop-yarn/timelineservice/*; /usr/hdp/3.1.0.0-78/hbase/bin/hbase --config /usr/hdp/3.1.0.0-78/hadoop/conf/embedded-yarn-ats-hbase org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -Dhbase.client.retries.number=35 -create -s'' was killed due timeout after 300 seconds From: /var/lib/ambari-agent/data/output-31848.txt 2019-10-22 16:41:17,992 WARN [main-EventThread] coordination.ZKSplitLogManagerCoordination$CreateRescanAsyncCallback: rc=NONODE for /atsv2-hbase-unsecure/splitWAL/RESCAN remaining retries=9223372036854744889 2019-10-22 16:41:17,992 WARN [main-EventThread] coordination.ZKSplitLogManagerCoordination$CreateRescanAsyncCallback: rc=NONODE for /atsv2-hbase-unsecure/splitWAL/RESCAN remaining retries=9223372036854735924 2019-10-22 16:41:17,992 WARN [main-EventThread] coordination.ZKSplitLogManagerCoordination$CreateRescanAsyncCallback: rc=NONODE for /atsv2-hbase-unsecure/splitWAL/RESCAN remaining retries=9223372036854772106 2019-10-22 16:41:17,992 WARN [main-EventThread] coordination.ZKSplitLogManagerCoordination$CreateRescanAsyncCallback: rc=NONODE for /atsv2-hbase-unsecure/splitWAL/RESCAN remaining retries=9223372036854768736 2019-10-22 16:41:17,992 WARN [main-EventThread] coordination.ZKSplitLogManagerCoordination$CreateRescanAsyncCallback: rc=NONODE for /atsv2-hbase-unsecure/splitWAL/RESCAN remaining retries=9223372036854749025 ==> /usr/logs/hadoop-yarn/embedded-yarn-ats-hbase/gc.log-201910110639 <== Java HotSpot(TM) 64-Bit Server VM (25.60-b23) for linux-amd64 JRE (1.8.0_60-b27), built on Aug 4 2015 12:19:40 by "java_re" with gcc 4.3.0 20080428 (Red Hat 4.3.0-8) Memory: 4k page, physical 131732324k(11848752k free), swap 8388604k(8279292k free) CommandLine flags: -XX:ErrorFile=/usr/logs/hadoop-yarn/embedded-yarn-ats-hbase/hs_err_pid%p.log -XX:InitialHeapSize=2107717184 -XX:MaxHeapSize=3435134976 -XX:MaxNewSize=1145044992 -XX:MaxTenuringThreshold=6 -XX:OldPLABSize=16 -XX:OnOutOfMemoryError=kill -9 %p -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC Heap par new generation total 618048K, used 197776K [0x00000006f3400000, 0x000000071d2a0000, 0x0000000737800000) eden space 549376K, 36% used [0x00000006f3400000, 0x00000006ff5243a8, 0x0000000714c80000) from space 68672K, 0% used [0x0000000714c80000, 0x0000000714c80000, 0x0000000718f90000) to space 68672K, 0% used [0x0000000718f90000, 0x0000000718f90000, 0x000000071d2a0000) concurrent mark-sweep generation total 1373568K, used 0K [0x0000000737800000, 0x000000078b560000, 0x00000007c0000000) Metaspace used 11629K, capacity 11810K, committed 11904K, reserved 1060864K class space used 1251K, capacity 1316K, committed 1408K, reserved 1048576K ==> /usr/logs/hadoop-yarn/embedded-yarn-ats-hbase/gc.log-201910100851 <== Java HotSpot(TM) 64-Bit Server VM (25.60-b23) for linux-amd64 JRE (1.8.0_60-b27), built on Aug 4 2015 12:19:40 by "java_re" with gcc 4.3.0 20080428 (Red Hat 4.3.0-8) Memory: 4k page, physical 131732324k(1591264k free), swap 8388604k(8280060k free) CommandLine flags: -XX:CMSInitiatingOccupancyFraction=70 -XX:ErrorFile=/usr/logs/hadoop-yarn/embedded-yarn-ats-hbase/hs_err_pid%p.log -XX:InitialHeapSize=3435134976 -XX:MaxHeapSize=3435134976 -XX:MaxNewSize=1145044992 -XX:MaxTenuringThreshold=6 -XX:NewSize=1145044992 -XX:OldPLABSize=16 -XX:OldSize=2290089984 -XX:OnOutOfMemoryError=kill -9 %p -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:ReservedCodeCacheSize=268435456 -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 2019-10-10T08:51:22.325-0700: 2.213: [GC (CMS Initial Mark) [1 CMS-initial-mark: 0K(2236416K)] 715687K(3242816K), 0.1832180 secs] [Times: user=0.48 sys=0.07, real=0.19 secs] 2019-10-10T08:51:22.508-0700: 2.396: [CMS-concurrent-mark-start] 2019-10-10T08:51:22.509-0700: 2.397: [CMS-concurrent-mark: 0.001/0.001 secs] [Times: user=0.01 sys=0.01, real=0.00 secs] 2019-10-10T08:51:22.509-0700: 2.397: [CMS-concurrent-preclean-start] 2019-10-10T08:51:22.513-0700: 2.400: [CMS-concurrent-preclean: 0.003/0.003 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 2019-10-10T08:51:22.513-0700: 2.400: [CMS-concurrent-abortable-preclean-start] 2019-10-10T08:51:22.827-0700: 2.715: [GC (Allocation Failure) 2019-10-10T08:51:22.827-0700: 2.715: [ParNew: 894592K->37233K(1006400K), 0.0334809 secs] 894592K->37233K(3242816K), 0.0335760 secs] [Times: user=0.17 sys=0.03, real=0.03 secs] Heap par new generation total 1006400K, used 577717K [0x00000006f3400000, 0x0000000737800000, 0x0000000737800000) eden space 894592K, 60% used [0x00000006f3400000, 0x00000007143d0fc8, 0x0000000729da0000) from space 111808K, 33% used [0x0000000730ad0000, 0x0000000732f2c758, 0x0000000737800000) to space 111808K, 0% used [0x0000000729da0000, 0x0000000729da0000, 0x0000000730ad0000) concurrent mark-sweep generation total 2236416K, used 0K [0x0000000737800000, 0x00000007c0000000, 0x00000007c0000000) Metaspace used 52260K, capacity 52701K, committed 53168K, reserved 1095680K class space used 5905K, capacity 6041K, committed 6096K, reserved 1048576K 2019-10-10T08:51:24.359-0700: 4.247: [CMS-concurrent-abortable-preclean: 1.100/1.847 secs] [Times: user=4.75 sys=0.27, real=1.85 secs] Command failed after 1 tries From: yarn-timelineserver-gc.log Total 180427 20331968 , 0.0109813 secs] 25.966: [GC (Allocation Failure) [PSYoungGen: 766976K->26024K(1047040K)] 786831K->45895K(2136576K), 0.0205774 secs] [Times: user=0.18 sys=0.02, real=0.02 secs] 27.877: [GC (Allocation Failure) [PSYoungGen: 987560K->37814K(1176576K)] 1007431K->57702K(2266112K), 0.0452135 secs] [Times: user=0.31 sys=0.03, real=0.05 secs] 29.872: [GC (Allocation Failure) [PSYoungGen: 1128886K->40013K(1176576K)] 1148774K->59908K(2266112K), 0.0376384 secs] [Times: user=0.25 sys=0.02, real=0.04 secs] 31.621: [GC (Allocation Failure) [PSYoungGen: 1131085K->41607K(1708032K)] 1150980K->61510K(2797568K), 0.0426743 secs] [Times: user=0.19 sys=0.02, real=0.04 secs] 34.381: [GC (Allocation Failure) [PSYoungGen: 1702023K->52721K(1713152K)] 1721926K->75113K(2802688K), 0.0671733 secs] [Times: user=0.32 sys=0.06, real=0.07 secs] 544.663: [GC (Allocation Failure) [PSYoungGen: 1713137K->24633K(2550784K)] 1735529K->55561K(3640320K), 0.0502315 secs] [Times: user=0.37 sys=0.08, real=0.05 secs] 1744.725: [GC (Allocation Failure) [PSYoungGen: 2550329K->6583K(2657792K)] 2581257K->37803K(3747328K), 0.0109360 secs] [Times: user=0.06 sys=0.05, real=0.01 secs] 3364.582: [GC (Allocation Failure) [PSYoungGen: 2603959K->7333K(2513408K)] 2635179K->38561K(3602944K), 0.0106033 secs] [Times: user=0.05 sys=0.05, real=0.01 secs] 4564.508: [GC (Allocation Failure) [PSYoungGen: 2513061K->7397K(2425856K)] 2544289K->38633K(3515392K), 0.0098975 secs] [Times: user=0.05 sys=0.05, real=0.01 secs] 5944.468: [GC (Allocation Failure) [PSYoungGen: 2425573K->7432K(2342400K)] 2456809K->38676K(3431936K), 0.0100541 secs] [Times: user=0.05 sys=0.04, real=0.01 secs] 6904.427: [GC (Allocation Failure) [PSYoungGen: 2342152K->7814K(2263040K)] 2373396K->39065K(3352576K), 0.0100246 secs] [Times: user=0.06 sys=0.05, real=0.01 secs] 7624.583: [GC (Allocation Failure) [PSYoungGen: 2262662K->7335K(2186240K)] 2293913K->38595K(3275776K), 0.0126832 secs] [Times: user=0.07 sys=0.03, real=0.01 secs] 8524.740: [GC (Allocation Failure) [PSYoungGen: 2185895K->7238K(2113536K)] 2217155K->38505K(3203072K), 0.0110849 secs] [Times: user=0.06 sys=0.05, real=0.01 secs] 9604.461: [GC (Allocation Failure) [PSYoungGen: 2113094K->7415K(2043904K)] 2144361K->38690K(3133440K), 0.0187939 secs] [Times: user=0.11 sys=0.07, real=0.02 secs] 10864.545: [GC (Allocation Failure) [PSYoungGen: 2043639K->7287K(1977344K)] 2074914K->38570K(3066880K), 0.0131232 secs] [Times: user=0.12 sys=0.04, real=0.01 secs] From: yarn-timelineserver-gc.log 2019-10-28 14:35:06,753 WARN timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_331 (TEZ_DAG_ID): 6 2019-10-28 14:35:06,753 WARN timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_332 (TEZ_DAG_ID): 6 2019-10-28 14:35:06,754 WARN timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_332 (TEZ_DAG_ID): 6 2019-10-28 14:35:06,754 WARN timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_333 (TEZ_DAG_ID): 6 2019-10-28 14:35:06,754 WARN timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_333 (TEZ_DAG_ID): 6 2019-10-28 14:35:06,755 WARN timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_334 (TEZ_DAG_ID): 6 2019-10-28 14:35:06,755 WARN timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_334 (TEZ_DAG_ID): 6 2019-10-28 14:35:06,755 INFO timeline.LogInfo (LogInfo.java:parseForStore(116)) - Parsed 1338 entities from hdfs://hdpnndev/ats/active/application_1572286585508_0006/appattempt_1572286585508_0006_000001/summarylog-appattempt_1572286585508_0006_000001 in 314 msec From: hadoop-yarn-resourcemanager-server.log TARGET=ClientRMService RESULT=SUCCESS 2019-10-28 14:36:08,043 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1572286585508_0005_000001 container=null queue=batchq1 clusterResource=<memory:411648, vCores:128> type=RACK_LOCAL requestedPartition= 2019-10-28 14:36:08,043 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e247_1572286585508_0005_01_000398 Container Transitioned from NEW to ALLOCATED 2019-10-28 14:36:08,043 INFO fica.FiCaSchedulerNode (FiCaSchedulerNode.java:allocateContainer(169)) - Assigned container container_e247_1572286585508_0005_01_000398 of capacity <memory:3072, vCores:1> on host server:45454, which has 6 containers, <memory:70656, vCores:6> used and <memory:32256, vCores:26> available after allocation 2019-10-28 14:36:08,043 INFO resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(200)) - USER=hive OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1572286585508_0005 CONTAINERID=container_e247_1572286585508_0005_01_000398 RESOURCE=<memory:3072, vCores:1> 2019-10-28 14:36:08,043 INFO capacity.ParentQueue (ParentQueue.java:apply(1336)) - assignedContainer queue=batch usedCapacity=0.14925392 absoluteUsedCapacity=0.11940298 used=<memory:49152, vCores:13> cluster=<memory:411648, vCores:128> 2019-10-28 14:36:08,043 INFO capacity.ParentQueue (ParentQueue.java:apply(1336)) - assignedContainer queue=root usedCapacity=0.40298507 absoluteUsedCapacity=0.40298507 used=<memory:165888, vCores:17> cluster=<memory:411648, vCores:128> 2019-10-28 14:36:08,043 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2900)) - Allocation proposal accepted 2019-10-28 14:36:08,103 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e247_1572286585508_0005_01_000398 Container Transitioned from ALLOCATED to ACQUIRED 2019-10-28 14:36:08,300 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1572286585508_0006_000001 container=null queue=batchq1 clusterResource=<memory:411648, vCores:128> type=OFF_SWITCH requestedPartition= 2019-10-28 14:36:08,300 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e247_1572286585508_0006_01_000647 Container Transitioned from NEW to ALLOCATED 2019-10-28 14:36:08,300 INFO fica.FiCaSchedulerNode (FiCaSchedulerNode.java:allocateContainer(169)) - Assigned container container_e247_1572286585508_0006_01_000647 of capacity <memory:3072, vCores:1> on host server:45454, which has 5 containers, <memory:18432, vCores:5> used and <memory:84480, vCores:27> available after allocation 2019-10-28 14:36:08,300 INFO resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(200)) - USER=hive OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1572286585508_0006 CONTAINERID=container_e247_1572286585508_0006_01_000647 RESOURCE=<memory:3072, vCores:1> 2019-10-28 14:36:08,300 INFO capacity.ParentQueue (ParentQueue.java:apply(1336)) - assignedContainer queue=batch usedCapacity=0.15858229 absoluteUsedCapacity=0.12686567 used=<memory:52224, vCores:14> cluster=<memory:411648, vCores:128> 2019-10-28 14:36:08,300 INFO capacity.ParentQueue (ParentQueue.java:apply(1336)) - assignedContainer queue=root usedCapacity=0.41044775 absoluteUsedCapacity=0.41044775 used=<memory:168960, vCores:18> cluster=<memory:411648, vCores:128> 2019-10-28 14:36:08,300 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2900)) - Allocation proposal accepted 2019-10-28 14:36:08,354 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e247_1572286585508_0005_01_000398 Container Transitioned from ACQUIRED to RELEASED 2019-10-28 14:36:08,354 INFO resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(200)) - USER=hive IP=10.10.81.14 OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1572286585508_0005 CONTAINERID=container_e247_1572286585508_0005_01_000398 RESOURCE=<memory:3072, vCores:1> 2019-10-28 14:36:08,354 INFO scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updatePendingResources(367)) - checking for deactivate of application :application_1572286585508_0005 2019-10-28 14:36:08,485 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e247_1572286585508_0006_01_000647 Container Transitioned from ALLOCATED to ACQUIRED 2019-10-28 14:36:08,736 INFO scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updatePendingResources(367)) - checking for deactivate of application :application_1572286585508_0006 2019-10-28 14:36:08,987 INFO scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updatePendingResources(367)) - checking for deactivate of application :application_1572286585508_0006 This are not the complete logs, just a glimpse. I hope it helps to come up with any idea. It gives me the impression it's heap memory issue. But... AppTimelineServer Java heap size = 8G , therefore any thought is appreciated. Regards!

JLo_Hernandez · ‎10-25-2019

Ambari 2.7.3 HDP 3.1.0 Hello team. My service went down 15 days ago and I'm still not able to fix this issue. If someone has face it before, please let me know. From last logs: Caused by: java.net.ConnectException: Call to atsserver/ip:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: atsserver/ip:17020 at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:165) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:390) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406) at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103) at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118) at org.apache.hadoop.hbase.ipc.BufferCallBeforeInitHandler.userEventTriggered(BufferCallBeforeInitHandler.java:92) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:329) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:315) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:307) at org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline.java:1377) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:329) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:315) at org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java:929) at org.apache.hadoop.hbase.ipc.NettyRpcConnection.failInit(NettyRpcConnection.java:179) at org.apache.hadoop.hbase.ipc.NettyRpcConnection.access$500(NettyRpcConnection.java:71) at org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:269) at org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:263) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:500) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:479) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122) at org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327) at org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343) at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633) at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) ... 1 more Caused by: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: atserver/ip:17020 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hbase.thirdparty.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323) at org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ... 7 more Caused by: java.net.ConnectException: Connection refused ... 11 more 2019-10-10 10:03:05,448 ERROR reader.TimelineReaderServer (LogAdapter.java:error(75)) - RECEIVED SIGNAL 15: SIGTERM 2019-10-10 10:03:05,457 INFO handler.ContextHandler (ContextHandler.java:doStop(910)) - Stopped o.e.j.w.WebAppContext@7357a011{/,null,UNAVAILABLE}{/timeline} 2019-10-10 10:03:05,460 INFO server.AbstractConnector (AbstractConnector.java:doStop(318)) - Stopped ServerConnector@1bae316d{HTTP/1.1,[http/1.1]}{0.0.0.0:8198} 2019-10-10 10:03:05,460 INFO handler.ContextHandler (ContextHandler.java:doStop(910)) - Stopped o.e.j.s.ServletContextHandler@4bff7da0{/static,jar:file:/usr/hdp/3.1.0.0-78/hadoop-yarn/hadoop-yarn-common-3.1.1.3.1.0.0-78.jar!/webapps/static,UNAVAILABLE} 2019-10-10 10:03:05,460 INFO handler.ContextHandler (ContextHandler.java:doStop(910)) - Stopped o.e.j.s.ServletContextHandler@3e62d773{/logs,file:///usr/logs/hadoop-yarn/yarn/,UNAVAILABLE} 2019-10-10 10:03:05,462 INFO storage.HBaseTimelineReaderImpl (HBaseTimelineReaderImpl.java:serviceStop(108)) - closing the hbase Connection 2019-10-10 10:03:05,462 INFO zookeeper.ReadOnlyZKClient (ReadOnlyZKClient.java:close(342)) - Close zookeeper connection 0x783a467b to zookeeper:2181,zookeeper:2181,zookeeper:2181,zookeeper:2181,zookeeper:2181 2019-10-10 10:03:05,465 INFO zookeeper.ReadOnlyZKClient (ReadOnlyZKClient.java:close(342)) - Close zookeeper connection 0x0ba2f4ec to zookeeper quorum 2019-10-10 10:03:05,467 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x26daa4f60630022 closed 2019-10-10 10:03:05,467 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(524)) - EventThread shut down 2019-10-10 10:03:05,468 INFO reader.TimelineReaderServer (LogAdapter.java:info(51)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down TimelineReaderServer at atsserver/ip ************************************************************/ I've followed the steps on this workaround and I am getting a new error message https://community.cloudera.com/t5/Support-Questions/ATS-hbase-does-not-seem-to-start/td-p/235155 - guillaume_roger We use to hace kerberized this cluster but it's not anymore a a few months ago. That's why this message is weird 2019-10-25 08:58:31,650 - HdfsResource['/atsv2/hbase/data'] {'security_enabled': False, 'hadoop_bin_dir': '/usr/hdp/3.1.0.0-78/hadoop/bin', 'keytab': '', 'dfs_type': 'HDFS', 'default_fs': 'hdfs://hdpnndev', 'hdfs_resource_ignore_file': '/var/lib/ambari-agent/data/.hdfs_resource_ignore', 'hdfs_site': ..., 'kinit_path_local': '/usr/bin/kinit', 'principal_name': '', 'user': 'hdfs', 'owner': 'yarnats', 'hadoop_conf_dir': '/usr/hdp/3.1.0.0-78/hadoop/conf', 'type': 'directory', 'action': ['create_on_execute'], 'immutable_paths': [u'/mr-history/done', u'/warehouse/tablespace/managed/hive', u'/warehouse/tablespace/external/hive', u'/app-logs', u'/tmp']} 2019-10-25 08:58:31,652 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl -s '"'"'http://masternode:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'"'"' 1>/tmp/tmpZs_d1F 2>/tmp/tmpNf5VJV''] {'quiet': False} Then , it got stuck several minutes on this part 2019-10-25 08:58:31,791 - call returned (0, '') 2019-10-25 08:58:31,792 - get_user_call_output returned (0, u'{\n "beans" : [ {\n "name" : "Hadoop:service=NameNode,name=FSNamesystem",\n "modelerType" : "FSNamesystem",\n "tag.Context" : "dfs",\n "tag.HAState" : "active",\n "tag.TotalSyncTimes" : "103 16 ",\n "tag.Hostname" : "zookeeper",\n "MissingBlocks" : 0,\n "MissingReplOneBlocks" : 0,\n "ExpiredHeartbeats" : 0,\n "TransactionsSinceLastCheckpoint" : 162564,\n "TransactionsSinceLastLogRoll" : 97,\n "LastWrittenTransactionId" : 136802255,\n "LastCheckpointTime" : 1571986118000,\n "CapacityTotal" : 94452535787520,\n "CapacityTotalGB" : 87966.0,\n "CapacityUsed" : 17431687303220,\n "CapacityUsedGB" : 16235.0,\n "CapacityRemaining" : 72211133116924,\n "ProvidedCapacityTotal" : 0,\n "CapacityRemainingGB" : 67252.0,\n "CapacityUsedNonDFS" : 0,\n "TotalLoad" : 128,\n "SnapshottableDirectories" : 1,\n "Snapshots" : 1,\n "NumEncryptionZones" : 0,\n "LockQueueLength" : 0,\n "BlocksTotal" : 985264,\n "NumFilesUnderConstruction" : 32,\n "NumActiveClients" : 21,\n "FilesTotal" : 1560325,\n "PendingReplicationBlocks" : 0,\n "PendingReconstructionBlocks" : 0,\n "UnderReplicatedBlocks" : 20,\n "LowRedundancyBlocks" : 20,\n "CorruptBlocks" : 0,\n "ScheduledReplicationBlocks" : 0,\n "PendingDeletionBlocks" : 597,\n "LowRedundancyReplicatedBlocks" : 20,\n "CorruptReplicatedBlocks" : 0,\n "MissingReplicatedBlocks" : 0,\n "MissingReplicationOneBlocks" : 0,\n "HighestPriorityLowRedundancyReplicatedBlocks" : 0,\n "HighestPriorityLowRedundancyECBlocks" : 0,\n "BytesInFutureReplicatedBlocks" : 0,\n "PendingDeletionReplicatedBlocks" : 597,\n "TotalReplicatedBlocks" : 985264,\n "LowRedundancyECBlockGroups" : 0,\n "CorruptECBlockGroups" : 0,\n "MissingECBlockGroups" : 0,\n "BytesInFutureECBlockGroups" : 0,\n "PendingDeletionECBlocks" : 0,\n "TotalECBlockGroups" : 0,\n "ExcessBlocks" : 0,\n "NumTimedOutPendingReconstructions" : 0,\n "PostponedMisreplicatedBlocks" : 24,\n "PendingDataNodeMessageCount" : 0,\n "MillisSinceLastLoadedEdits" : 0,\n "BlockCapacity" : 2097152,\n "NumLiveDataNodes" : 4,\n "NumDeadDataNodes" : 0,\n "NumDecomLiveDataNodes" : 0,\n "NumDecomDeadDataNodes" : 0,\n "VolumeFailuresTotal" : 0,\n "EstimatedCapacityLostTotal" : 0,\n "NumDecommissioningDataNodes" : 0,\n "StaleDataNodes" : 0,\n "NumStaleStorages" : 48,\n "TotalSyncCount" : 76,\n "NumInMaintenanceLiveDataNodes" : 0,\n "NumInMaintenanceDeadDataNodes" : 0,\n "NumEnteringMaintenanceDataNodes" : 0\n } ]\n}', u'') 2019-10-25 08:58:31,793 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl -s '"'"'http://yarnatsserver:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'"'"' 1>/tmp/tmpmLa9e3 2>/tmp/tmpcGNRPR''] {'quiet': False} Finally it finishes with this message 2019-10-25 09:17:53,022 - call returned (143, "-bash: line 1: 28392 Terminated curl -s 'http://yarnatsserver:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' > /tmp/tmpmLa9e3 2> /tmp/tmpcGNRPR") 2019-10-25 09:17:53,022 - call['hdfs haadmin -ns hdpnndev -getServiceState nn2'] {'logoutput': True, 'user': 'hdfs'}

JLo_Hernandez · ‎10-21-2019

Hello dnavarro. Thank you for asking... Actually my question was because I wasn't getting complete size of hdfs that I have on the output command hdfs -du against the size I was watching on the ambari UI. There was a significant difference on it. At the end, it turned up that that difference was due to that hdp 2.6.5 version command hdfs dfs -du is not taking in count the snapshots sizes, therefore, there we have the missmatching I was looking since the beginning and why I needed that output. Just keep in mind that for version grater than 2.6.5 you need to take snapshots size separately if you want to check entire hdfs size 😉 😉

JLo_Hernandez · ‎09-13-2019

I am tyring to get this output. Nevertheless some articles points out this is due to a version matter, which I doubt, since I've seen this behavior on multiple environments with different versions is just that I don't know how to enable it. If someone can point me to the right path I would highly appreciate it. Regards!

JLo_Hernandez · ‎08-07-2019

Is there any sort of formula or how did you came up with this value for users's processes? is it a random value? what can I check within my cluster in order to get a proper value for me?

JLo_Hernandez · ‎03-29-2019

Any one who can help us to get an official site to get his jar to make it work with tibco? Regards!

JLo_Hernandez · ‎10-11-2018

This is a known behavior from Hortonworks, check the discussion https://issues.apache.org/jira/browse/HDFS-8986 Regards!

JLo_Hernandez · ‎10-11-2018

Hello team. Current lecture directly to the involved directory is giving us: hadoop fs -du -h -s </path/dir> t 1.4 T </path/dir> of information but this is not accurate because when getting each and one size from the files and directories within above directory, performing a simple addition operation from excel we are getting 386574657917 which translated in GB is 360GB which is the expected size within this directory. Has someone face this before? Regards!

JLo_Hernandez · ‎09-13-2018

Also check https://community.hortonworks.com/articles/14309/demystify-tez-tuning-step-by-step.html https://community.hortonworks.com/questions/5780/hive-on-tez-query-map-output-outofmemoryerror-java.html https://community.hortonworks.com/questions/12067/what-is-the-workaround-when-getting-hive-outofmemo.html

JLo_Hernandez · ‎09-13-2018

What I understand from documentation (I am applying same changes) and from previous experience is that happens that some jobs has too many inputs and outputs to be handle by the system itself (mappers, containers, reducers etc...) this affects from one side your cluster performance since resources are not enough for such jobs to complete or simply they start to present a considerable delay to what is expected. I've seen that's the moment when OOM error shows up. Saying this, I guess explanation from the smartsense is clear since it tells us "Enabling auto-scaling scales down requests, which may otherwise be too large." Hope this helps 😉

Online	Offline
Last Visited	‎09-22-2020 02:25 PM

Member Since	‎01-14-2016 10:55 PM
Last Visited	‎09-22-2020 02:25 PM
Posts	49
Kudos received	4

Cloudera Community

Re: How to make "hdfs dfs -du" command show 2 size...

Re: "hadoop fs -du" command giving incorrect value

Re: What's the problem on having dfs.namenode.hand...

Re: Getting "Exception in thread "main" java.lang....

Re: How to restart just spark client on Ambari

Re: Yarn Timeline Service V2 not starting

Yarn Timeline Service V2 not starting

Re: How to make "hdfs dfs -du" command show 2 size...

How to make "hdfs dfs -du" command show 2 size col...

Re: Increase open file limit of the user to scale ...

Where can I download hadoop-core-1.2.1.jar?

Re: "hadoop fs -du" command giving incorrect value

"hadoop fs -du" command giving incorrect value

Re: Need to understand impact for setting tez.task...

Re: Need to understand impact for setting tez.task...