About mike_bronson7

Daming Xue · ‎02-26-2023

Hello Mike Likely your existing 3 Zookeeper nodes can serve your expansion requirements You can monitor the CPU and network of the Zookeeper nodes when your Kafka cluster is growing, when reaching the throughput limit, you can expand your zookeeper to 5 nodes Remember the zookeeper nodes need to keep in sync all the time, so the more zookeeper nodes the more traffic will be added to keep them in sync, while those nodes handling the Kafka requests; so it doesn't mean the more the better I would suggest to stay with 3 zookeeper nodes while expanding your kafka cluster with close monitoring, and consider to grow to 5 when the CPU/network throughput reaching the limit You can also consider to tune the zookeeper nodes e.g. dedicated disks, better network throughput, isolate zookeeper process, disable swaps

mike_bronson7 · ‎02-22-2023

we have HDP cluster version 2.6.5 when we look on name-node logs we can see the following warning 2023-02-20 15:58:31,377 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction 2023-02-20 16:00:39,037 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction 2023-02-20 16:01:43,962 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193594954980-193594954980 took 1329ms 2023-02-20 16:02:47,129 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595018764-193595018764 took 1321ms 2023-02-20 16:03:52,763 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595106645-193595106646 took 1344ms 2023-02-20 16:04:56,276 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595175233-193595175233 took 1678ms 2023-02-20 16:06:01,067 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595252052-193595252052 took 1265ms 2023-02-20 16:07:06,447 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595320796-193595320796 took 1273ms in our HDP cluster , HDFS service include 2 name-node services and 3 journal-Nodes cluster include 736 data nodes machines , and HDFS service is the manager of all data-node we want to understand what is the reason for the following warning ? and how to avoid this messages by proactive solution server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595018764-193595018764 took 1321ms

mike_bronson7 · ‎01-24-2023

we have HDP cluster version - 2.6.5 with Ambari platform here is example from our Ambari lab cluster with 5 mode managers machines regarding to YARN service - is it possible to add in Ambari the widget that can show CPU core consuming ? if not what are the other ways to find the CORE consuming by YARN from cli ? other way that we found is from the `resource_manager:8088/cluster` as the following so is it possible to find some API / CLI that can capture the VCores Used ?

mike_bronson7 · ‎01-24-2023

we have spark production cluster with YARN service ( based on HDP 2.6.5 version ) total node-managers services are - 745 ( actually 745 Linux machines ) and yarn active resource-manager and standby resourcemanager are installed on different masters machines we found that the following parameters not defined in our YARN configuration ( yarn-site.xml ) ! yarn.scheduler.increment-allocation-vcores yarn.scheduler.increment-allocation-mb and above parameters not defined not in Ambari and not in YARN XML configuration files! I want to know what the meaning of the parameter - yarn.scheduler.increment-allocation-vcores ? and what is the affect if this parameters are not defined in our configuration? from YARN best practice configuration we are understanding that both parameters are part of YARN configuration , but we not sure if we must to add them to YARN custom configuration from documentation we found: Minimum and maximum allocation unit in YARN Two resources—memory and CPU, as of in Hadoop 2.5.1, have minimum and maximum allocation unit in YARN, as set by the configurations in yarn-site.xml. Basically, it means RM can only allocate memory to containers in increments of “yarn.scheduler.minimum-allocation-mb” and not exceed “yarn.scheduler.maximum-allocation-mb” It can only allocate CPU vcores to containers in increments of “yarn.scheduler.minimum-allocation-vcores” and not exceed “yarn.scheduler.maximum-allocation-vcores” If changes required, set above configurations in yarn-site.xml on RM nodes, and restart RM. reference: https://docs.trifacta.com/display/r076/Tune+Cluster+Performance https://stackoverflow.com/questions/58522138/how-to-control-yarn-container-allocation-increment-properly https://pratikbarjatya.github.io/learning/best-practices-for-yarn-resource-management/ https://stackoverflow.com/questions/58522138/how-to-control-yarn-container-allocation-increment-properly

mike_bronson7 · ‎01-12-2023

I want to say also that node-manager restart or fully restart of yarn service fixed the problem , but as you know this isn't the right solution that should be every time that one of the node manager became die

mabilgen · ‎12-23-2022

I have the same problem, but in my case JVM pause detections are happening every 15 mins and detecting between 28337ms, .. 853466ms pauses.

Kartik_Agarwal · ‎12-09-2022

The following areas normally cause this problem: 1) the connection from Ambari agent host to Ambari Server got lost. 2) firewall issue blocked connections. 3) hostname and IP address are not being set correctly in /etc/hosts You can compare the output using these APIs: > curl u user:paswd http://AmbariHost:8080/api/v1/hosts

mike_bronson7 · ‎11-29-2022

we have 3 Kafka brokers machines on on RHEL 7.9 Linux version , ( when each machine is physical strong `DELL HW` - memory = 512G and 96 CORE CPU ) Kafka cluster is in production mode Kafka version is 2.7.x , and Kafka disks are in Jbod configuration each Kafka broker has 8 Jbod disks , as we can see from the following ( df -h details ) df -h /dev/sdc 1.7T 929G 748G 56% /kafka/kafka_logs2 /dev/sdd 1.7T 950G 727G 57% /kafka/kafka_logs3 /dev/sde 1.7T 999G 678G 60% /kafka/kafka_logs4 /dev/sdf 1.7T 971G 706G 58% /kafka/kafka_logs5 /dev/sdg 1.7T 1.7T 20K 100% /kafka/kafka-logs6 <----------------- /dev/sdh 1.7T 962G 714G 58% /kafka/kafka_logs7 /dev/sdi 1.7T 1.1T 621G 63% /kafka/kafka_logs8 as we can see from above disk - `/kafka/kafka_logs6` get `100%` used after short investigation we found that Kafka broker isn't tolerant when one disk is failed or disk reached to 100% , as results of this Kafka broker now is down here the Kafka `server.log` [2022-11-29 15:43:59,723] ERROR Error while writing to checkpoint file /kafka/kafka-logs6 ............. java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:326) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) at java.io.BufferedWriter.flush(BufferedWriter.java:254) at kafka.server.checkpoints.CheckpointFile.liftedTree1$1(CheckpointFile.scala:108) at kafka.server.checkpoints.CheckpointFile.write(CheckpointFile.scala:92) at kafka.server.checkpoints.LeaderEpochCheckpointFile.write(LeaderEpochCheckpointFile.scala:70) at kafka.server.epoch.LeaderEpochFileCache.flush(LeaderEpochFileCache.scala:292) at kafka.server.epoch.LeaderEpochFileCache.$anonfun$truncateFromEnd$1(LeaderEpochFileCache.scala:238) at kafka.server.epoch.LeaderEpochFileCache.truncateFromEnd(LeaderEpochFileCache.scala:235) at kafka.log.Log.$anonfun$new$1(Log.scala:305) at kafka.log.Log.<init>(Log.scala:305) at kafka.log.Log$.apply(Log.scala:2549) at kafka.log.LogManager.loadLog(LogManager.scala:273) at kafka.log.LogManager.$anonfun$loadLogs$12(LogManager.scala:352) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) **from my perspective** when we have **8 disks** on each broker and one disks is failed ( like reached 100% ) , then we expect Kafka broker still be alive even one disk is failed so as results of above scenario we searched in Kafka `server.properties` the parameter that can help us to configure the Kafka broker to be tolerant when one disk is failed but we not found it , or maybe we not know what to set in order to define the kafka broker to be tolerant with one disk failure the full parameters are: more server.properties auto.create.topics.enable=false auto.leader.rebalance.enable=true background.threads=10 log.retention.bytes=-1 log.retention.hours=48 delete.topic.enable=true leader.imbalance.check.interval.seconds=300 leader.imbalance.per.broker.percentage=10 log.dir=/kafka/kafka-logs2,/kafka/kafka-logs3 ............... log.flush.interval.messages=9223372036854775807 log.flush.interval.ms=1000 log.flush.offset.checkpoint.interval.ms=60000 log.flush.scheduler.interval.ms=9223372036854775807 log.flush.start.offset.checkpoint.interval.ms=60000 compression.type=producer log.roll.jitter.hours=0 log.segment.bytes=1073741824 log.segment.delete.delay.ms=60000 message.max.bytes=1000012 min.insync.replicas=1 num.io.threads=10 num.network.threads=48 num.recovery.threads.per.data.dir=1 num.replica.fetchers=1 offset.metadata.max.bytes=4096 offsets.commit.required.acks=-1 offsets.commit.timeout.ms=5000 offsets.load.buffer.size=5242880 offsets.retention.check.interval.ms=600000 offsets.retention.minutes=10080 offsets.topic.compression.codec=0 offsets.topic.num.partitions=50 offsets.topic.replication.factor=3 offsets.topic.segment.bytes=104857600 queued.max.requests=1000 quota.consumer.default=9223372036854775807 quota.producer.default=9223372036854775807 replica.fetch.min.bytes=1 replica.fetch.wait.max.ms=500 replica.high.watermark.checkpoint.interval.ms=5000 replica.lag.time.max.ms=10000 replica.socket.receive.buffer.bytes=65536 replica.socket.timeout.ms=30000 request.timeout.ms=30000 socket.receive.buffer.bytes=102400 socket.request.max.bytes=104857600 socket.send.buffer.bytes=102400 transaction.max.timeout.ms=900000 transaction.state.log.load.buffer.size=5242880 transaction.state.log.min.isr=2 transaction.state.log.num.partitions=50 transaction.state.log.replication.factor=3 transaction.state.log.segment.bytes=104857600 transactional.id.expiration.ms=604800000 unclean.leader.election.enable=false zookeeper.connection.timeout.ms=600000 zookeeper.max.in.flight.requests=10 zookeeper.session.timeout.ms=600000 zookeeper.set.acl=false broker.id.generation.enable=true connections.max.idle.ms=600000 connections.max.reauth.ms=0 controlled.shutdown.enable=true controlled.shutdown.max.retries=3 controlled.shutdown.retry.backoff.ms=5000 controller.socket.timeout.ms=30000 default.replication.factor=3 delegation.token.expiry.time.ms=86400000 delegation.token.max.lifetime.ms=604800000 delete.records.purgatory.purge.interval.requests=1 fetch.purgatory.purge.interval.requests=1000 group.initial.rebalance.delay.ms=3000 group.max.session.timeout.ms=1800000 group.max.size=2147483647 group.min.session.timeout.ms=6000 log.cleaner.backoff.ms=15000 log.cleaner.dedupe.buffer.size=134217728 log.cleaner.delete.retention.ms=86400000 log.cleaner.enable=true log.cleaner.io.buffer.load.factor=0.9 log.cleaner.io.buffer.size=524288 log.cleaner.io.max.bytes.per.second=1.7976931348623157e308 log.cleaner.max.compaction.lag.ms=9223372036854775807 log.cleaner.min.cleanable.ratio=0.5 log.cleaner.min.compaction.lag.ms=0 log.cleaner.threads=1 log.cleanup.policy=delete log.index.interval.bytes=4096 log.index.size.max.bytes=10485760 log.message.timestamp.difference.max.ms=9223372036854775807 log.message.timestamp.type=CreateTime log.preallocate=false log.retention.check.interval.ms=300000 max.connections=2147483647 max.connections.per.ip=2147483647 max.incremental.fetch.session.cache.slots=1000 num.partitions=1 producer.purgatory.purge.interval.requests=1000 queued.max.request.bytes=-1 replica.fetch.backoff.ms=1000 replica.fetch.max.bytes=1048576 replica.fetch.response.max.bytes=10485760 reserved.broker.max.id=1500 transaction.abort.timed.out.transaction.cleanup.interval.ms=60000 transaction.remove.expired.transaction.cleanup.interval.ms=3600000 zookeeper.sync.time.ms=2000 broker.rack=/default-rack I want to add my personal feeling: *just to gives here the absurder of the above scenario lets say we have on each Kafka broker 100 disks ( in Jbod ) is it make sense that Kafka broker will be shutdown just because one disk is failed ?*

mike_bronson7 · ‎11-16-2022

we have in our Hadoop cluster 3 Kafka machines, Kafka machines include the following services on kafka01: confluent kafka service confluent schema registry service zookeeper service on kafka02: confluent kafka service confluent schema registry service zookeeper service on kafka03: confluent kafka service confluent schema registry service zookeeper service on our Kafka cluster , we have different 34 topics one of the topic is topic name - `car.to.go` this topic include 3 replica in Kafka cluster what we found regarding to schema registry service , is that topic - `car.to.go` have different versions on kafka03 from other machines - `kafka01/02` so actually we get that versions are not in sync here is example: [root@kafka01 ~]# curl -X GET http://kafka01:8081/subjects/car.to.go-value/versions [1,2,3,4,5,6,7] [root@kafka01 ~]# curl -X GET http://kafka02:8081/subjects/car.to.go-value/versions [1,2,3,4,5,6,7] [root@kafka01 ~]# curl -X GET http://kafka03:8081/subjects/car.to.go-value/versions [1,2,3,4] from above info what could be the resewn that versions on kafka03 is diff from kafka01/02 regarding to topic `car.to.go` - what is the right way to fix this issue in way that we get on kafka03 the number versions as - `1,2,3,4,5,6,7` notes: all Kafka services and schema registry services and zookeeper services are up and running links: https://github.com/confluentinc/schema-registry/blob/master/README.md https://docs.confluent.io/platform/current/schema-registry/develop/using.html#check-if-a-schema-is-registered-under-subject-kafka-key https://kafkawize.com/2019/03/17/commands-for-schemaregistry/

mike_bronson7 · ‎10-28-2022

we have Ambari hadoop cluster based on Ambari platform and installed with HDP version - 2.6.5 , when all machines in the cluster are with RHEL 7.9 version Ambari cluster of course include the YARN service with two resource manager services we are facing a problems about ( when master1 and master2 nodes running with the resources manager services ) Connection failed to http://master2.start.com:8088 (timed out) we tested the below alert with following `wget` approach , when alert appears on Ambari then following `wget` test is hang and sometimes its take a time until `wget` finished with results [root@master2 yarn]# wget http://master2.start.com:8088 --2022-10-28 08:12:49-- http://master2.start.com:8088/ Resolving master2.start.com (master2.start.com)... 172.3.45.68 Connecting to master2.start.com (master2.start.com)|172.3.45.68|:8088... connected. HTTP request sent, awaiting response... 307 TEMPORARY_REDIRECT Location: http://master1.start.com:8088/ [following] --2022-10-28 08:12:50-- http://master1.start.com:8088/ Resolving master1.start.com (master1.start.com)... 172.3.45.61 Connecting to master1.start.com (master01.start.com)|172.3.45.61|:8088... connected. HTTP request sent, awaiting response... 302 Found Location: http://master1.start.com:8088/cluster [following] --2022-10-28 08:12:50-- http://master1.start.com:8088/cluster Reusing existing connection to master1.start.com:8088. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: ‘index.html.35’ [ <=> ] 5,419,141 9.46MB/s in 0.5s 2022-10-28 08:12:52 (9.46 MB/s) - ‘index.html.35’ saved [5419141] port 8088 are licensing from both nodes ps -ef | grep `lsof -i :8088 | grep -i listen | awk '{print $2}'` yarn 1977 1 16 Oct27 ? 02:37:32 /usr/jdk64/jdk1.8.0_112/bin/java -Dproc_resourcemanager also we cheeked with jps jps -l | grep -i resourcemanager 1977 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager we also verify the resources manager logs and we see the following 2022-10-27 08:04:30,071 WARN webapp.GenericExceptionHandler (GenericExceptionHandler.java:toResponse(98)) - INTERNAL_SERVER_ERROR java.lang.NullPointerException at org.apache.hadoop.yarn.api.records.ContainerId.toString(ContainerId.java:196) at org.apache.hadoop.yarn.util.ConverterUtils.toString(ConverterUtils.java:165) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppInfo.<init>(AppInfo.java:169) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getApps(RMWebServices.java:603) at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) 2022-10-27 08:04:32,056 WARN webapp.GenericExceptionHandler (GenericExceptionHandler.java:toResponse(98)) - INTERNAL_SERVER_ERROR java.lang.NullPointerException at org.apache.hadoop.yarn.api.records.ContainerId.toString(ContainerId.java:196) at org.apache.hadoop.yarn.util.ConverterUtils.toString(ConverterUtils.java:165) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppInfo.<init>(AppInfo.java:169) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getApps(RMWebServices.java:603) at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResource 2022-10-27 08:05:43,170 ERROR recovery.RMStateStore (RMStateStore.java:notifyStoreOperationFailedInternal(992)) - State store operation failed org.apache.hadoop.yarn.server.resourcemanager.recovery.StoreFencedException: RMStateStore has been fenced at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1213) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:1042) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:639) 2022-10-27 08:05:49,584 ERROR delegation.AbstractDelegationTokenSecretManager (AbstractDelegationTokenSecretManager.java:run(659)) - ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted 2022-10-28 08:43:31,259 ERROR metrics.SystemMetricsPublisher (SystemMetricsPublisher.java:putEntity(549)) - Error when publishing entity [YARN_APPLICATION,application_1664925617878_1896] com.sun.jersey.api.client.ClientHandlerException: java.io.IOException: Stream closed. at com.sun.jersey.api.client.ClientResponse.bufferEntity(ClientResponse.java:583) at org.apache.hadoop.yarn.client.api.impl.TimelineWriter.doPostingObject(TimelineWriter.java:157) at org.apache.hadoop.yarn.client.api.impl.TimelineWriter$1.run(TimelineWriter.java:115) at org.apache.hadoop.yarn.client.api.impl.TimelineWriter$1.run(TimelineWriter.java:112) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) at org.apache.hadoop.yarn.client.api.impl.TimelineWriter.doPosting(TimelineWriter.java:112) at org.apache.hadoop.yarn.client.api.impl.TimelineWriter.putEntities(TimelineWriter.java:92) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:348) at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.putEntity(SystemMetricsPublisher.java:536) at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.publishApplicationACLsUpdatedEvent(SystemMetricsPublisher.java:392) at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.handleSystemMetricsEvent(SystemMetricsPublisher.java:257) at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:564) at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:559) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Stream closed. at java.net.AbstractPlainSocketImpl.available(AbstractPlainSocketImpl.java:470) at java.net.SocketInputStream.available(SocketInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:353) at sun.net.www.http.ChunkedInputStream.readAheadBlocking(ChunkedInputStream.java:552) at sun.net.www.http.ChunkedInputStream.readAhead(ChunkedInputStream.java:609) still not clearly from the above logs why resources manager are with the alerts - `Connection failed to http://master2.start.com:8088 (timed out)`

Online	Offline
Last Visited	‎08-27-2024 09:17 AM

Member Since	‎08-08-2017 09:40 AM
Last Visited	‎08-27-2024 09:17 AM
Posts	1,652
Kudos received	29

Cloudera Community

Re: how to find number of CPU core on datanode ma...

Re: postgresql + ambari server failed to open port...

Re: how to stop the thrift servers by REST API

Re: namenode is in safe mode

Re: Directory /grid/sdg/hadoop/hdfs/data became un...

Re: how many zookeeper servers need to serve ~100 ...

HDP cluster + journal nodes get out of Sync

YARN + how to measure the consuming of number of c...

What is the meaning of yarn.scheduler.increment-al...

Re: Yarn + how to avoid node manager that marked a...

Re: JVM + still have Detected pause in JVM in spit...

Re: how to detected HeartBeat Lost state between a...

Kafka apache on jbod disks + why kafka broker fail...

kafka + schema registry + why versions are not the...

yarn + alert of Connection failed to http via port...