Member since
02-18-2019
83
Posts
3
Kudos Received
0
Solutions
10-26-2020
11:32 PM
Hello, On our data node, we are increasing getting alerts related to Data Node Pause Duration. So far, this is happening on a single data node out of nine data nodes. Following is the error captured from DN logs 2020-10-27 16:20:05,140 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1821ms GC pool 'ParNew' had collection(s): count=1 time=2075ms) Current Java Heap Size of Data Node in Bytes is at 6GB CM / CDH – 5.16.x Any help is appreciated. Regards Amn
... View more
Labels:
09-13-2020
07:33 PM
Hello, How to update DNS part in host name, example from 192.168.0.1.test-1.pl to 192.168.0.1.test.co.pl so that we can open CM GUI via 192.168.0.1.test.co.pl:7180 and other web gui (hue, impala etc) with the new dns name test.co.pl. Regards Anm
... View more
Labels:
07-19-2020
03:07 AM
Hi @GangWar Thanks for your reply.
... View more
07-17-2020
12:44 AM
Hello, I need to reboot my Data node to fix an underlying network issue. OS team needs around 3 hours (approx.) to complete this activity, what would be the best approach: Decommission and recommission host / dn. Move DN to Offline Mode (feature in CM 5.16.x) Any help / guidance is appreciated. Regards Amn
... View more
Labels:
- Labels:
-
Cloudera Manager
05-14-2020
06:02 AM
Hi@Madhur Appreciate your assistance, I am using CM, where would this setting be in CM for making the changes in Cluster level, and to confirm these values have to be passed in seconds ? Could you also provide steps / document outlining how to change this while submitting spark jobs.
... View more
05-14-2020
01:08 AM
Hi @Madhur This is happening with all Spark jobs, there has been no changes in the code or cluster, also the failure is random.
... View more
05-13-2020
09:42 PM
Hello All, We are running Spark jobs via yarn and its failing with the below error, any help / pointer to fix is much appericated. Shell output: main : command provided 1
main : run as user is TEST1
main : requested yarn user is TEST1
Writing to tmp file /data/8/yarn/nm/nmPrivate/application_1587389136999_0013/container_e56_1587389136999_0013_01_000477/container_e56_1587389136999_0013_01_000477.pid.tmp
Writing to cgroup task files...
Container exited with a non-zero exit code 1
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
at scala.util.Try$.apply(Try.scala:192)
at scala.util.Failure.recover(Try.scala:216)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
... 8 more Regards Amn
... View more
Labels:
- Labels:
-
Apache Spark
-
Cloudera Manager
04-01-2020
10:41 PM
Hi @venkatsambath Appericate your help, would like some assistance in configuring NN heap size, would like to know what would be the best way to move forward: Increase NN heap via Cloudera Manager (HDFS>Config> NN>Heap (HA Cluster)) OR Changing it via /etc/hadoop/conf/hadoop-env.sh (HADOOP_NAMENODE_OPTS) Kindly Advise Regards Amn
... View more
03-25-2020
12:55 AM
Hi @StevenOD I tried to run re-balance tool but I get below error, Failed RPC negotiation. Trace:
0325 20:44:12.092074 (+ 0us) reactor.cc:577] Submitting negotiation task for server connection from XXX.XX.XXX.XXX:52183
0325 20:44:12.092167 (+ 93us) server_negotiation.cc:176] Beginning negotiation
0325 20:44:12.092170 (+ 3us) server_negotiation.cc:365] Waiting for connection header
0325 20:44:12.096890 (+ 4720us) server_negotiation.cc:373] Connection header received
0325 20:44:12.098104 (+ 1214us) server_negotiation.cc:329] Received NEGOTIATE NegotiatePB request
0325 20:44:12.098105 (+ 1us) server_negotiation.cc:412] Received NEGOTIATE request from client
0325 20:44:12.098128 (+ 23us) server_negotiation.cc:341] Sending NEGOTIATE NegotiatePB response
0325 20:44:12.098177 (+ 49us) server_negotiation.cc:197] Negotiated authn=SASL
0325 20:44:12.104531 (+ 6354us) server_negotiation.cc:329] Received TLS_HANDSHAKE NegotiatePB request
0325 20:44:12.106114 (+ 1583us) server_negotiation.cc:341] Sending TLS_HANDSHAKE NegotiatePB response
0325 20:44:12.115849 (+ 9735us) server_negotiation.cc:329] Received TLS_HANDSHAKE NegotiatePB request
0325 20:44:12.116299 (+ 450us) server_negotiation.cc:341] Sending TLS_HANDSHAKE NegotiatePB response
0325 20:44:12.116346 (+ 47us) server_negotiation.cc:581] Negotiated TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 TLSv1.2 Kx=ECDH Au=RSA Enc=AES(256) Mac=SHA384
0325 20:44:12.123359 (+ 7013us) negotiation.cc:304] Negotiation complete: Network error: Server connection negotiation failed: server connection from XXX.XX.XXX.XXX:52183: BlockingRecv error: failed to read from TLS socket: Cannot send after transport endpoint shutdown (error 108)
Metrics: {"server-negotiator.queue_time_us":53} Thanks Amn
... View more
03-22-2020
07:19 PM
Hi, We are repeatedly getting alerts for NN Pause duration (The health test result for NAME_NODE_PAUSE_DURATION has become bad: Average time spent paused was 44.5 second(s) (74.25%) per minute over the previous 5 minute(s). Critical threshold: 60.00%.) CM / CDH - 5.16.2 Current NN Heap Size - 4GB Block Used - 5TB Request some assistance in fixing this. Thanks Amn
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Cloudera Manager
-
HDFS