Member since
03-23-2015
7
Posts
0
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5629 | 03-24-2015 09:35 PM |
06-19-2015
10:15 AM
Thanks Ben. I did not get OOM in the logs of (RegionServers), (HMaster), (Zookeeper), Yarn Container logs. So I don't think it was starving from memory. However clouderaManager graphs for Regionserver reported frequent Garbage Collection, mean time for GC ~(23m(ms)) I went adhead with your recommendations on increasing regionserver memory. Since I had the liberty of a node with 122GB of memory, I bumped the (Java Heap Size of HBase RegionServer) from it default 4GB to 16GB. After the switch I no loger have ReagionServer crashes and applications are performing great. Later I checked the heap utilization of RegionServer on ClouderaManager Graphs, it uses (12-14 GB) of the acclocated heap. Also my GC graphs show (86s(ms)) which is 100 times lower that ~(23m(ms)). I wish I can share both these graphs.
... View more
06-16-2015
12:33 PM
CDH 5.3.3 Hbase Version : 0.98.6-cdh5.3.3 One-by-one all RegionServer Crashes while running MR jobs on Yarn, logs indicate Zookeeper related issue. Batch jobs runs for 30-45 mins before regionservers crashes. Eventually crashing all the regionservers. HbaseMaster (Active) & HbaseMaster(Standby) are unaffected. ------------Region Server Log Before crashing-------------- 2015-06-15 20:21:22,498 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 1691 ms, current pipeline: [10.168.32.216:50010, 10.168.32.217:50010, 10.168.32.214:50010]
2015-06-15 20:21:22,498 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 1691 ms, current pipeline: [10.168.32.216:50010, 10.168.32.217:50010, 10.168.32.214:50010]
2015-06-15 20:21:22,499 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 1688 ms, current pipeline: [10.168.32.216:50010, 10.168.32.217:50010, 10.168.32.214:50010]
2015-06-15 20:21:22,499 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 1690 ms, current pipeline: [10.168.32.216:50010, 10.168.32.217:50010, 10.168.32.214:50010]
2015-06-15 20:21:27,557 INFO org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1614ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=1655ms
2015-06-15 20:21:27,573 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 1655 ms, current pipeline: [10.168.32.216:50010, 10.168.32.217:50010, 10.168.32.214:50010]
2015-06-15 20:21:29,103 INFO org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1045ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=1276ms
2015-06-15 20:21:33,820 INFO org.apache.zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.5-cdh5.3.3--1, built on 04/08/2015 21:51 GMT
2015-06-15 20:21:33,839 INFO org.apache.zookeeper.ZooKeeper: Client environment:host.name=i-9f2ad949.prod-dis11.aws1
2015-06-15 20:21:33,839 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.version=1.7.0_67
2015-06-15 20:21:33,839 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
2015-06-15 20:21:33,839 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.7.0_67-cloudera/jre
2015-06-15 20:21:33,840 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.class.path=/var/run/cloudera-scm-agent/process/2052-hbase-REGIONSERVER:/usr/java/jdk1.7.0_67-cloudera/lib/tools.jar:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/....... (big CLASSPATH omitted for brevity)
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.library.path=/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.name=Linux
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.arch=amd64
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.version=2.6.32-358.14.1.el6.x86_64
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.name=hbase
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.home=/var/lib/hbase
2015-06-15 20:21:33,842 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.dir=/var/run/cloudera-scm-agent/process/2052-hbase-REGIONSERVER
2015-06-15 20:21:33,843 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=i-802bd856.prod-dis11.aws1:2181,i-0d2bd8db.prod-dis11.aws1:2181,i-942ad942.prod-dis11.aws1:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@24ac7456
2015-06-15 20:21:33,864 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server i-0d2bd8db.prod-dis11.aws1/10.168.32.11:2181. Will not attempt to authenticate using SASL (unknown error)
2015-06-15 20:21:33,870 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to i-0d2bd8db.prod-dis11.aws1/10.168.32.11:2181, initiating session
2015-06-15 20:21:33,877 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server i-0d2bd8db.prod-dis11.aws1/10.168.32.11:2181, sessionid = 0x14dfa17990303cd, negotiated timeout = 30000 ------------Corrosponding Zookeeper Log-------------- .....
2015-06-15 20:21:17,707 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /10.168.32.218:53815
2015-06-15 20:21:17,708 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x14dfa17990303cb with negotiated timeout 180000 for client /10.168.32.218:53815
2015-06-15 20:21:17,963 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /10.168.32.218:53825
2015-06-15 20:21:17,967 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /10.168.32.218:53825
2015-06-15 20:21:17,968 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x14dfa17990303cc with negotiated timeout 180000 for client /10.168.32.218:53825
2015-06-15 20:21:32,068 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x14dfa1799030007, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)
2015-06-15 20:21:32,068 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.168.32.216:40742 which had sessionid 0x14dfa1799030007
2015-06-15 20:21:33,870 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /10.168.32.216:59976
2015-06-15 20:21:33,874 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /10.168.32.216:59976
2015-06-15 20:21:33,876 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x14dfa17990303cd with negotiated timeout 30000 for client /10.168.32.216:59976
2015-06-15 20:21:34,208 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x14dfa17990303cd, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)
2015-06-15 20:21:34,209 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.168.32.216:59976 which had sessionid 0x14dfa17990303cd
2015-06-15 20:21:36,321 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.168.32.221:33684 which had sessionid 0x14dfa17990303c3
2015-06-15 20:21:38,409 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x14dfa179903000b, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)
.....
.....
... View more
03-24-2015
09:35 PM
Was able to fix this however had few hickups. Here are the details 1) Set oozie.launcher.mapreduce.task.classpath.user.precedence to true 2) Precedene was alterd. Howerever encountered: -------------------------------------------------------------------------------------------- Application application_1426549201832_0261 failed 2 times due to AM Container for appattempt_1426549201832_0261_000002 exited with exitCode: 1 due to: Exception from container-launch. Container id: container_1426549201832_0261_02_000001 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 .Failing this attempt.. Failing the application. 2015-03-24 15:43:30,899 FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster java.lang.NoSuchMethodError: org.apache.hadoop.mapred.TaskLog.createLogSyncer()Ljava/util/concurrent/ScheduledExecutorService; at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.<init>(MRAppMaster.java:244) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.<init>(MRAppMaster.java:227) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1390) 2015-03-24 15:43:30,913 INFO [main] org.apache.hadoop.util.ExitUtil: Exiting with status 1 -------------------------------------------------------------------------------------------- Suggesting... Jars found in the artifact lib for hadoop-code were in conflict with the expected jars by the CDH5.3.2 cluster we had. 3) Hadoop-core jars from Artifacts POM were excluded from the artifact package.... <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <scope>provided</scope> </dependency> Things started to work as expected hereafter.
... View more
03-23-2015
04:04 PM
On CDH5.3.2, classpath for artifacts lib jar does not precedes over Cloudera Jars. Our artifact uses (aws-java-sdk-1.7.12.jar) which is part of artifact Lib dir. However Cloudera loads a previous version jar (aws-java-sdk-1.7.4.jar) Current Jar loading Precedence: 1) /opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/hadoop/lib/aws-java-sdk-1.7.4.jar 2) /data3/yarn/nm/usercache/oozie/appcache/application_.../container_.../aws-java-sdk-1.7.12.jar This throws throws java.lang.NoSuchMethodError Desired Jar Loading Precedence: 1) /data3/yarn/nm/usercache/oozie/appcache/application_.../container_.../aws-java-sdk-1.7.12.jar 2) /opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/hadoop/lib/aws-java-sdk-1.7.4.jar Tried vairous online suggested property file changes to 'true'. * oozie.launcher.mapreduce.task.classpath.user.precedence * mapreduce.task.classpath.user.precedence * mapreduce.task.classpath.first * mapreduce.job.user.classpath.first * mapreduce.user.classpath.first * mapreduce.task.classpath.user.precedence However all the above in the job.properties and workflow file did not change the precedence. I working with a Java action here. This feature was easily reverted in previous CHD4+ releases. Please help.
... View more
Labels: