About xsanderson

xsanderson · ‎05-30-2017

Hi Matt, I tried exactly what you suggested while I was waiting for your reply. I was able to access the UI without the data flow running. I looked at the System Diagnostics you mentioned earlier. It was "4 times" without data flow running and still "8 times" now after I started data flows. Our cluster has 2-node and each has 16 cores. The "max timer driven thread count" is set to 64 and "max event driven thread count" is set to 12. I've been monitoring "top", cpu usage at this time (busy hour) is about 700%. Good news is after I started nifi in stopped state and manually restarted the data flows, the problem I had this morning has not recurred yet. Heartbeats are generated in reasonable intervals - about 7 seconds. What happened this morning is still a mystry to me, but I am happy now it's working. Thank you so much for all the help!!! Xi

xsanderson · ‎05-30-2017

Hi Matt, We are using jdk1.8.0_31 and nifi.version=1.0.0.2.0.1.0-12. The following is the first few lines of the output of jstat: [root@be-bi-nifi-441 conf]# /usr/java/default/bin/jstat -gcutil 3248 250 1000 S0 S1 E O M CCS YGC YGCT FGC FGCT GCT 0.00 100.00 36.84 33.25 95.26 89.80 5545 1731.464 4 3.563 1735.027 0.00 100.00 77.89 33.25 95.26 89.80 5545 1731.464 4 3.563 1735.027 0.00 100.00 93.68 33.25 95.26 89.80 5546 1731.464 4 3.563 1735.027 0.00 100.00 93.68 33.25 95.26 89.80 5546 1731.464 4 3.563 1735.027 0.00 100.00 26.32 33.90 95.26 89.80 5546 1731.930 4 3.563 1735.492 0.00 100.00 64.21 33.90 95.26 89.80 5546 1731.930 4 3.563 1735.492 0.00 100.00 93.68 33.90 95.26 89.80 5547 1731.930 4 3.563 1735.492 0.00 100.00 93.68 33.90 95.26 89.80 5547 1731.930 4 3.563 1735.492 Looks like NiFi is busy with GC just like you suspected, but I do not understand why. Can you please give me some advices on how to debug this without UI access? Thank you very much! Xi

xsanderson · ‎05-30-2017

Hi Matt, Thank you so much for getting back to me so quickly. I cannot access NiFi UI due to the nodes connected, then disconnected so quickly from the cluster. I do see the following logs in nifi node log about every minute: 2017-05-30 11:01:16,777 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog@5af47414 checkpointed with 4 Records and 0 Swap Files in 10 milliseconds (Stop-the-world time = 2 milliseconds, Clear Edit Logs time = 3 millis), max Transaction ID 114 Is this normal or indication of a problem? The cluster was fine yesterday when I checked and nothing is changed - I am the only person can make changes, so I know for sure. Thanks again! Xi

xsanderson · ‎05-30-2017

Hi, We have a 2-node nifi production cluster with HDF-2.0.1.0 release. It works fine for over a year now. This morning both nodes are connecting, connected, then disconnected from cluster due to lack of heartbeat. nifi.cluster.protocol.heartbeat.interval in nifi.properties is the default 5 sec. From nifi node log, I do not see the heartbeats are created every 5 seconds - in my working dev cluster they are created roughly every 5 seconds. In this production cluster the heartbeats are created every 1 or 2 minutes. 2017-05-30 10:30:07,838 INFO [Clustering Tasks Thread-1] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-30 10:30:07,653 and sent to be-bi-nifi-441.soleocommunications.com:8085 at 2017-05-30 10:30:07,838; send took 184 millis 2017-05-30 10:31:14,986 INFO [Clustering Tasks Thread-1] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-30 10:31:14,515 and sent to be-bi-nifi-441.soleocommunications.com:8085 at 2017-05-30 10:31:14,986; send took 471 millis 2017-05-30 10:33:44,971 INFO [Clustering Tasks Thread-2] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-30 10:33:44,404 and sent to be-bi-nifi-441.soleocommunications.com:8085 at 2017-05-30 10:33:44,971; send took 566 millis 2017-05-30 10:34:15,280 INFO [Clustering Tasks Thread-3] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-30 10:34:15,122 and sent to be-bi-nifi-441.soleocommunications.com:8085 at 2017-05-30 10:34:15,280; send took 157 millis 2017-05-30 10:36:21,204 INFO [Clustering Tasks Thread-3] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-30 10:36:20,673 and sent to be-bi-nifi-441.soleocommunications.com:8085 at 2017-05-30 10:36:21,204; send took 530 millis This cluster worked fine yesterday and nothing changed on the system. Can anyone give me some insight why the heartbeats are not created as configured? Thank you very much in advance! Xi Sanderson

xsanderson · ‎03-02-2016

Problem solved by changing the ulimit on both service accounts and user accounts. 32k for files, 64k for processes worked for me.

xsanderson · ‎02-29-2016

Hi, We have some queries that work fine with small set of data, but when I am pulling a month worth of data, I got the following error: java.io.IOException: Failed on local exception: java.io.IOException: Couldn't set up IO streams; Host Details : local host is: "be-bi-secondary-528.soleocommunications.com/10.10.11.6"; destination host is: "be-bi-secondary-528.soleocommunications.com":8020; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773) at org.apache.hadoop.ipc.Client.call(Client.java:1431) at org.apache.hadoop.ipc.Client.call(Client.java:1358) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy16.mkdirs(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:558) at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy17.mkdirs(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:3008) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2978) at org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1047) at org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1043) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:1043) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:1036) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1877) at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:226) at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:137) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1655) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1414) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1195) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:213) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:165) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:736) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: java.io.IOException: Couldn't set up IO streams at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:791) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:373) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1493) at org.apache.hadoop.ipc.Client.call(Client.java:1397) ... 39 more Caused by: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:713) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:784) ... 42 more Error launching map-reduce job These queries used to work with large data set before. I start seeing this problem after I upgraded HDP from 2.2.4.2 to 2.3.2. I tried few things people suggested online, such as increase ulimit (from 1024 to 64000), increase map/reduce java.opts (in my hive session before running the job, from system setting -Xmx2867m to -Xmx10240m), they didn't help. I also saw people talking about turning max data transfer threads, my system is already set to a pretty high value suggested by SmartSense. Any help will be greatly appreciated! Xi

xsanderson · ‎02-26-2016

Hi all, I opened a support ticket and got answer back regarding metastore alerts. It is a known bug in the Ambari release I have (2.1.2): https://issues.apache.org/jira/browse/AMBARI-14424 The suggested solution is to change script: /var/lib/ambari-server/resources/common-services/HIVE/0.12.0.2.0/package/alerts/alert_hive_metastore.py search for 30 and replace with 120, then restart Ambari server. Still yet to monitor how the changes work. Thank for all the helps from you guys! Xi

xsanderson · ‎02-25-2016

Hi, Yes, we are using SmartSense. I will open a support ticket too. Here is one of the alerts: Services Reporting Alerts OK [HIVE] CRITICAL [HIVE] HIVE OK Hive Metastore Process Metastore OK - Hive command took 9.718s CRITICAL Hive Metastore Process Metastore on be-bi-secondary-528.soleocommunications.com failed (Execution of 'ambari-sudo.sh su ambari-qa -l -s /bin/bash -c 'export PATH='"'"'/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/usr/sbin:/bin:/usr/bin:/var/lib/ambari-agent:/bin/:/usr/bin/:/usr/sbin/:/usr/hdp/current/hive-metastore/bin'"'"' ; export HIVE_CONF_DIR='"'"'/usr/hdp/current/hive-metastore/conf/conf.server'"'"' ; hive --hiveconf hive.metastore.uris=thrift://be-bi-secondary-528.soleocommunications.com:9083 --hiveconf hive.metastore.client.connect.retry.delay=1 --hiveconf hive.metastore.failure.retries=1 --hiveconf hive.metastore.connect.retries=1 --hiveconf hive.metastore.client.socket.timeout=14 --hiveconf hive.execution.engine=mr -e '"'"'show databases;'"'"''' was killed due timeout after 30 seconds) This notification was sent to Ambari Alert From TheOracle Apache Ambari 2.1.2 Thanks, Xi

xsanderson · ‎02-25-2016

Hi Artem, I implemented the suggestion in the thread Neeraj referred, but still have the issue. On light days, I get 5, 6; on heavy days, still over 10. I am also getting a lot of Hive Metastore check alerts (... '"'"'show databases;'"'"''' was killed due timeout after 30 seconds) with OK and Critical in the same email. Last night I got hundreds of those. It has to do with the load on the cluster. Any help is appreciated! Xi

xsanderson · ‎01-20-2016

Hi Neeraj, Thank you very much for the link. I will give it a try. Xi

Online	Offline
Last Visited	‎01-12-2018 07:32 PM

Member Since	‎12-14-2015 05:57 PM
Last Visited	‎01-12-2018 07:32 PM
Posts	12
Kudos received	11

Cloudera Community

Re: Hive java.io.IOException: Couldn't set up IO s...

Re: Many Ambari "stale alerts" messages

Re: NiFi node keeps disconnecting from the cluster

Re: NiFi node keeps disconnecting from the cluster

Re: NiFi node keeps disconnecting from the cluster

NiFi node keeps disconnecting from the cluster

Re: Hive java.io.IOException: Couldn't set up IO s...

Hive java.io.IOException: Couldn't set up IO strea...

Re: Many Ambari "stale alerts" messages

Re: Many Ambari "stale alerts" messages

Re: Many Ambari "stale alerts" messages

Re: Many Ambari "stale alerts" messages