Member since 
    
	
		
		
		08-08-2017
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                1652
            
            
                Posts
            
        
                30
            
            
                Kudos Received
            
        
                11
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 1910 | 06-15-2020 05:23 AM | |
| 15411 | 01-30-2020 08:04 PM | |
| 2045 | 07-07-2019 09:06 PM | |
| 8090 | 01-27-2018 10:17 PM | |
| 4554 | 12-31-2017 10:12 PM | 
			
    
	
		
		
		02-22-2023
	
		
		08:30 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 for now we have 15 Kafka machines in the cluster ,     all machines are are installed with RHEL 7.9 and the HW machine is DELL physical machine     Kafka version is 2.7 , and we have 3 zookeeper servers that serve the Kafka cluster     we decided to extend the Kafka cluster to ~100 machines , because Total Throughput In Megabytes increased dramatic     - note  according to Kafka confluent calculator we need around 100 Kafka machines     in that case I am wonder if our 3 zookeeper servers are enough to serve this huge cluster machines?      addition I want to say that our 3 zookeeper servers are already serve other application as HDFS , YARN , HIVE , spark etc.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Zookeeper
			
    
	
		
		
		01-24-2023
	
		
		09:23 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 we have HDP cluster version - 2.6.5 with Ambari platform     here is example from our Ambari lab cluster with 5 mode managers machines            regarding to YARN service - is it possible to add in Ambari the widget that can show CPU core  consuming ?     if not what are the other ways to find the CORE consuming by YARN from cli ?    other way that we found is from the `resource_manager:8088/cluster`  as the following         so is it possible to find some API / CLI that can capture the VCores Used ?    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Ambari Blueprints
			
    
	
		
		
		01-24-2023
	
		
		08:41 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 we have spark production cluster with YARN service ( based on HDP 2.6.5 version )     total node-managers services are - 745 ( actually 745 Linux machines )     and yarn active resource-manager and standby resourcemanager are installed on different masters machines     we found that the following parameters not defined in our YARN configuration ( yarn-site.xml ) !  yarn.scheduler.increment-allocation-vcores
yarn.scheduler.increment-allocation-mb  and above parameters not defined not in Ambari and not in YARN XML configuration files!     I want to know what the meaning of the parameter - yarn.scheduler.increment-allocation-vcores ?     and what is the affect if this parameters are not defined in our configuration?     from YARN best practice configuration we are understanding that both parameters are part of YARN configuration , but we not sure if we must to add them to YARN custom configuration     from documentation we found:  Minimum and maximum allocation unit in YARN   Two resources—memory and CPU, as of in Hadoop 2.5.1, have minimum and maximum allocation unit in YARN, as set by the configurations in yarn-site.xml.  Basically, it means RM can only allocate memory to containers in increments of “yarn.scheduler.minimum-allocation-mb” and not exceed “yarn.scheduler.maximum-allocation-mb”  It can only allocate CPU vcores to containers in increments of “yarn.scheduler.minimum-allocation-vcores” and not exceed “yarn.scheduler.maximum-allocation-vcores”  If changes required, set above configurations in yarn-site.xml on RM nodes, and restart RM.      reference:     https://docs.trifacta.com/display/r076/Tune+Cluster+Performance     https://stackoverflow.com/questions/58522138/how-to-control-yarn-container-allocation-increment-properly     https://pratikbarjatya.github.io/learning/best-practices-for-yarn-resource-management/     https://stackoverflow.com/questions/58522138/how-to-control-yarn-container-allocation-increment-properly           
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Ambari Blueprints
			
    
	
		
		
		01-12-2023
	
		
		12:09 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I want to say also that node-manager restart or fully restart of yarn service fixed the problem , but as you know this isn't the right solution that should be every time that one of the node manager became die  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-11-2023
	
		
		11:36 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Dear @Shelton , long time that we not meet , glad to see you again   back to my Question , since we are talking on node manager ,  my goal is to avoid cases like node-manager service is die or not sync with the resource manager , please forgive me but I not understand why you talking about data node and exclude data node from the cluster , because the question is on different subject , and as I mention we want to understand the root cause of lost node manager and how to do proactive steps in order to avoid such of this problems   additionally as I understand most of this problems are as results of bad network that break the connectivity between node manager to resources manager , so in spite some times this behavior is happening , we are trying to set the configuration that give the cluster to be stable in spite all networking problems or INFA problems      let me know if my question is clear so we can continue with our discussion , and sorry again if my first post was not clearly  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-11-2023
	
		
		08:33 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 we have huge production Hadoop cluster, with HDP version 2.6.5 and ambari version 2.6.2.2 , and all machines are with OS RHEL 7.6 version     the cluster size is as the following :     Total workers machines - 425 ( each worker include data node and node manager service )     from time to time we get indication of lost one or two **node-manager** and this identified from Ambari as ( 424/425 when total node-manager  are 425 )     in order to fix it we just restart the **node-manager** and this action fix the problem and as results we get 425/425     after some googling , we found the following parameters that maybe should be tune better     yarn.client.nodemanager-connect.max-wait-ms ( its configured to 60000 ms and we think to increase it )    yarn.client.nodemanager-connect.retry-interval-ms ( its configured to 10 sec ms and we think to increase it )    yarn.nm.liveness-monitor.expiry-interval-ms ( this parameter not configured yet and we think to add it with value of 1500000 ms )     based on above details , I will appreciate to get comments or others ideas     background:  NodeManager is LOST means that ResourceManager haven't received heartbeats from it for a duration of nm.liveness-monitor.expiry-interval-ms milliseconds (default is 10 minutes). 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Ambari Blueprints
			
    
	
		
		
		12-04-2022
	
		
		05:58 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 we want to find the **approach / test /cli / API** that gives us the results about HeartBeat Lost between Ambari agent to Ambari server  HeartBeat Lost could be as results of poor connection between Ambari agent to Ambari server or because Ambari server was down for along time , etc     Note - from Ambari GUI the machine with **HeartBeat Lost** state usually colored by yellow state              clarification:        the case as described here appears when `ambari-agent status` is in running state as the following  ambari-agent status  Found ambari-agent PID: 119315  ambari-agent running.  Agent PID at: /run/ambari-agent/ambari-agent.pid  Agent out at: /var/log/ambari-agent/ambari-agent.out  Agent log at: /var/log/ambari-agent/ambari-agent.log 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Ambari Blueprints
			
    
	
		
		
		11-29-2022
	
		
		08:07 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 we have 3 Kafka brokers machines on on RHEL 7.9 Linux version ,     ( when each machine is physical strong `DELL HW` - memory = 512G and 96 CORE CPU )     Kafka cluster is in production mode        Kafka version is 2.7.x , and Kafka disks are in Jbod configuration  each Kafka broker has 8 Jbod disks , as we can see from the following ( df -h details )    df -h    /dev/sdc 1.7T 929G 748G 56% /kafka/kafka_logs2  /dev/sdd 1.7T 950G 727G 57% /kafka/kafka_logs3  /dev/sde 1.7T 999G 678G 60% /kafka/kafka_logs4  /dev/sdf 1.7T 971G 706G 58% /kafka/kafka_logs5  /dev/sdg 1.7T 1.7T 20K 100% /kafka/kafka-logs6 <-----------------  /dev/sdh 1.7T 962G 714G 58% /kafka/kafka_logs7  /dev/sdi 1.7T 1.1T 621G 63% /kafka/kafka_logs8        as we can see from above disk - `/kafka/kafka_logs6` get `100%` used    after short investigation we found that Kafka broker isn't tolerant when one disk is failed or disk reached to 100% , as results of this Kafka broker now is down  here the Kafka `server.log`        [2022-11-29 15:43:59,723] ERROR Error while writing to checkpoint file /kafka/kafka-logs6 .............  java.io.IOException: No space left on device  at java.io.FileOutputStream.writeBytes(Native Method)  at java.io.FileOutputStream.write(FileOutputStream.java:326)  at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)  at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)  at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)  at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)  at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)  at java.io.BufferedWriter.flush(BufferedWriter.java:254)  at kafka.server.checkpoints.CheckpointFile.liftedTree1$1(CheckpointFile.scala:108)  at kafka.server.checkpoints.CheckpointFile.write(CheckpointFile.scala:92)  at kafka.server.checkpoints.LeaderEpochCheckpointFile.write(LeaderEpochCheckpointFile.scala:70)  at kafka.server.epoch.LeaderEpochFileCache.flush(LeaderEpochFileCache.scala:292)  at kafka.server.epoch.LeaderEpochFileCache.$anonfun$truncateFromEnd$1(LeaderEpochFileCache.scala:238)  at kafka.server.epoch.LeaderEpochFileCache.truncateFromEnd(LeaderEpochFileCache.scala:235)  at kafka.log.Log.$anonfun$new$1(Log.scala:305)  at kafka.log.Log.<init>(Log.scala:305)  at kafka.log.Log$.apply(Log.scala:2549)  at kafka.log.LogManager.loadLog(LogManager.scala:273)  at kafka.log.LogManager.$anonfun$loadLogs$12(LogManager.scala:352)  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)  at java.util.concurrent.FutureTask.run(FutureTask.java:266)  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)  at java.lang.Thread.run(Thread.java:750)  **from my perspective** when we have **8 disks** on each broker and one disks is failed ( like reached 100% ) , then we expect Kafka broker still be alive even one disk is failed  so as results of above scenario we searched in Kafka `server.properties` the parameter that can help us to configure the Kafka broker to be tolerant when one disk is failed  but we not found it , or maybe we not know what to set in order to define the kafka broker to be tolerant with one disk failure    the full parameters are:    more server.properties  auto.create.topics.enable=false  auto.leader.rebalance.enable=true  background.threads=10  log.retention.bytes=-1  log.retention.hours=48  delete.topic.enable=true  leader.imbalance.check.interval.seconds=300  leader.imbalance.per.broker.percentage=10  log.dir=/kafka/kafka-logs2,/kafka/kafka-logs3 ...............  log.flush.interval.messages=9223372036854775807  log.flush.interval.ms=1000  log.flush.offset.checkpoint.interval.ms=60000  log.flush.scheduler.interval.ms=9223372036854775807  log.flush.start.offset.checkpoint.interval.ms=60000  compression.type=producer  log.roll.jitter.hours=0  log.segment.bytes=1073741824  log.segment.delete.delay.ms=60000  message.max.bytes=1000012  min.insync.replicas=1  num.io.threads=10  num.network.threads=48  num.recovery.threads.per.data.dir=1  num.replica.fetchers=1  offset.metadata.max.bytes=4096  offsets.commit.required.acks=-1  offsets.commit.timeout.ms=5000  offsets.load.buffer.size=5242880  offsets.retention.check.interval.ms=600000  offsets.retention.minutes=10080  offsets.topic.compression.codec=0  offsets.topic.num.partitions=50  offsets.topic.replication.factor=3  offsets.topic.segment.bytes=104857600  queued.max.requests=1000  quota.consumer.default=9223372036854775807  quota.producer.default=9223372036854775807  replica.fetch.min.bytes=1  replica.fetch.wait.max.ms=500  replica.high.watermark.checkpoint.interval.ms=5000  replica.lag.time.max.ms=10000  replica.socket.receive.buffer.bytes=65536  replica.socket.timeout.ms=30000  request.timeout.ms=30000  socket.receive.buffer.bytes=102400  socket.request.max.bytes=104857600  socket.send.buffer.bytes=102400  transaction.max.timeout.ms=900000  transaction.state.log.load.buffer.size=5242880  transaction.state.log.min.isr=2  transaction.state.log.num.partitions=50  transaction.state.log.replication.factor=3  transaction.state.log.segment.bytes=104857600  transactional.id.expiration.ms=604800000  unclean.leader.election.enable=false  zookeeper.connection.timeout.ms=600000  zookeeper.max.in.flight.requests=10  zookeeper.session.timeout.ms=600000  zookeeper.set.acl=false  broker.id.generation.enable=true  connections.max.idle.ms=600000  connections.max.reauth.ms=0  controlled.shutdown.enable=true  controlled.shutdown.max.retries=3  controlled.shutdown.retry.backoff.ms=5000  controller.socket.timeout.ms=30000  default.replication.factor=3  delegation.token.expiry.time.ms=86400000  delegation.token.max.lifetime.ms=604800000  delete.records.purgatory.purge.interval.requests=1  fetch.purgatory.purge.interval.requests=1000  group.initial.rebalance.delay.ms=3000  group.max.session.timeout.ms=1800000  group.max.size=2147483647  group.min.session.timeout.ms=6000  log.cleaner.backoff.ms=15000  log.cleaner.dedupe.buffer.size=134217728  log.cleaner.delete.retention.ms=86400000  log.cleaner.enable=true  log.cleaner.io.buffer.load.factor=0.9  log.cleaner.io.buffer.size=524288  log.cleaner.io.max.bytes.per.second=1.7976931348623157e308  log.cleaner.max.compaction.lag.ms=9223372036854775807  log.cleaner.min.cleanable.ratio=0.5  log.cleaner.min.compaction.lag.ms=0  log.cleaner.threads=1  log.cleanup.policy=delete  log.index.interval.bytes=4096  log.index.size.max.bytes=10485760  log.message.timestamp.difference.max.ms=9223372036854775807  log.message.timestamp.type=CreateTime  log.preallocate=false  log.retention.check.interval.ms=300000  max.connections=2147483647  max.connections.per.ip=2147483647  max.incremental.fetch.session.cache.slots=1000  num.partitions=1  producer.purgatory.purge.interval.requests=1000  queued.max.request.bytes=-1  replica.fetch.backoff.ms=1000  replica.fetch.max.bytes=1048576  replica.fetch.response.max.bytes=10485760  reserved.broker.max.id=1500  transaction.abort.timed.out.transaction.cleanup.interval.ms=60000  transaction.remove.expired.transaction.cleanup.interval.ms=3600000  zookeeper.sync.time.ms=2000  broker.rack=/default-rack     I want to add my personal feeling:  *just to gives here the absurder of the above scenario  lets say we have on each Kafka broker 100 disks ( in Jbod )  is it make sense that Kafka broker will be shutdown just because one disk is failed ?* 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Kafka
			
    
	
		
		
		11-16-2022
	
		
		09:28 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 we have in our Hadoop cluster 3 Kafka machines,  Kafka machines include the following services  on kafka01:  confluent kafka service  confluent schema registry service  zookeeper service    on kafka02:  confluent kafka service  confluent schema registry service  zookeeper service    on kafka03:  confluent kafka service  confluent schema registry service  zookeeper service  on our Kafka cluster , we have different 34 topics  one of the topic is topic name - `car.to.go`  this topic include 3 replica in Kafka cluster  what we found regarding to schema registry service , is that topic - `car.to.go` have different versions on kafka03 from other machines - `kafka01/02`  so actually we get that versions are not in sync  here is example:    [root@kafka01 ~]# curl -X GET http://kafka01:8081/subjects/car.to.go-value/versions  [1,2,3,4,5,6,7]  [root@kafka01 ~]# curl -X GET http://kafka02:8081/subjects/car.to.go-value/versions  [1,2,3,4,5,6,7]  [root@kafka01 ~]# curl -X GET http://kafka03:8081/subjects/car.to.go-value/versions  [1,2,3,4]  from above info what could be the resewn that versions on kafka03 is diff from kafka01/02  regarding to topic `car.to.go` - what is the right way to fix this issue in way that we get on kafka03 the number versions as - `1,2,3,4,5,6,7`  notes:  all Kafka services and schema registry services and zookeeper services are up and running     links:  https://github.com/confluentinc/schema-registry/blob/master/README.md  https://docs.confluent.io/platform/current/schema-registry/develop/using.html#check-if-a-schema-is-registered-under-subject-kafka-key  https://kafkawize.com/2019/03/17/commands-for-schemaregistry/    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Hadoop
			
    
	
		
		
		10-30-2022
	
		
		05:49 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 so based on doc seems that we need to increase the CMSInitiatingOccupancyFraction from default 70% to higher value as for example 85%     do you agree with that ?  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
         
					
				













