About w@leed

w@leed · ‎11-20-2019

This issue would really require further debugging. For whatever reason, at that particular time something happened with the user ID resolution. We've seen customers before that had similar issues when tools like SSSD is being used: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/sssd-system-uids One idea here is to create a shell script that runs the command 'id ptz0srv0z50' and 'id -Gn ptz0srv0z50' in a loop based on some interval. say 10, 20 or 30 seconds and when the problem occurs just go over the output of that shell script and see if you notice anything different in the output at the time of the issue.

w@leed · ‎11-20-2019

@paleerbccm it's still the same issue but the log you're sharing doesn't show the details we would need. Your problem is that when the container for the ApplicationMaster attempts to launch on a particular host machine as the ptzs0srv0z50, this shell command fails because the user ID doesn't exist on this machine. What you need to do is identify where the Application Master attempted to run. You can do this from the Resource Manager's WebUI. Please refer to the following screenshot for example and note the highlighted red box: You'll see in this example that the first ApplicationMaster attempt was on the host machine worker.example.com. You would then need to SSH into that host machine and run the following command to see if this user actually exists or not: id ptzs0srv0z50

w@leed · ‎11-12-2019

Hi @PARTOMIA09 One suggestion off the bat is to possibly consider moving to a G1GC policy instead given that you have relatively large heap sizes (30 GB for executors and 16 GB for the driver). Typically the G1GC policy was developed to be better performant for larger heaps (> 8 GB). Try the following and see if that helps: --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35" --conf "spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35" "The Garbage-First Garbage Collector": https://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html

w@leed · ‎10-02-2019

Hi @ravikiran_sharm I'm sorry you had to experience this mishap. I'll reach out to my managers to locate someone internally who can help you out with this. Sorry again for the inconvenience!

w@leed · ‎09-30-2019

Hi @paleerbccm The issue here is that this user ID doesn't exist on one of your YARN NodeManager machines. That probably also explains the randomness of the issue. So long as a container doesn't end up allocated on that machine then you won't run into any problems. You need to find where the containers fail to launch with that exception, then SSH into the host machine and confirm the problem by running: id pyqy0srv0z50 You will then need to be sure to create this user on that machine and make sure that the user's group membership matches whatever you have on all your other hosts.

w@leed · ‎08-23-2019

@paleerbccm Briefly looking at the message, I would assume 'error_code=0' actually means that no errors occurred. It would need quite a bit of digging in the code to understand, but generally speaking, I wouldn't worry too much about TRACE level logs. Ideally, and especially that this is a production environment, you would normally set logging level to INFO and that's about all you would need. Unless you have an intimate knowledge of the code and you're chasing after a specific issue, it's rare that you would ever need TRACE level logs.

w@leed · ‎08-23-2019

Hi @iamabug It's a known limitation in Kafka where the kafka-topics tool communicates directly with Zookeeper. When you create a topic, all the tool does is connect to Zookeeper, creates a znode representing this topic and then sets some data as a JSON string (the metadata for the topic). There has been work to develop Java admin clients which made some progress: https://cwiki.apache.org/confluence/display/KAFKA/KIP-117%3A+Add+a+public+AdminClient+API+for+Kafka+admin+operations#KIP-117:AddapublicAdminClientAPIforKafkaadminoperations-FutureWork However, all that's left is to have command line tools that leverage those Java APIs and that's a work in progress: https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations

w@leed · ‎08-21-2019

@nicolasgernigon I'm kind of curious about this issue. Did you manage to figure this one out? I tried to look through the Kafka producer code a little bit to get a better understanding of what might have happened but it needs a bit more digging. So far what I have found is that a transaction by the Kafka producer will maintain different states throughout its lifetime, where it would start with an UNINITIALIZED state. However, a starting state can never be FATAL_ERROR and so the attempt by the Kafka producer to try and initialize this transaction and change its state from FATAL_ERROR to INITIALIZING is an invalid one. This all seems to imply from my understanding that there was another issue that happened beforehand causing this particular transaction to fail in the first place. I don't see why the Kafka producer would stop sending any messages all together after this failure, that seems to be an issue in possibly the Kafka client. Could you try enabling the 'use Transactions' property for your PublishKafkaRecord processor but at the same time also be sure to change the topic name to a new one? Be sure that this topic is a newly created one that hasn't been written to previously. Please do let me know if you experience the same issue or not once you make that change.

w@leed · ‎08-21-2019

Thanks for confirming. Yes, I see that this is in fact possible in CDH 5 as well when I did a quick check. Just remember that starting with CDH 6.0, both Spark 2.x and Kafka are bundled in CDH, so the only way you would have two different versions is by running two different CDH versions from Cloudera Manager.

w@leed · ‎08-21-2019

Hi @iamabug Yes, you can definitely have two different clusters managed by the same Cloudera Manager instance. In return, you can have different CDH versions for the two clusters, however, my initial thoughts was that you wouldn't be able to activate two different Kafka parcel versions, for example CDK 3.0 on Cluster A and CDK 2.2 on Cluster B. But I need to double check this because I have some doubts if that is in fact the case in CDH 5.x or not. In CDH 6.x you're bound to a specific version of Kafka depending on your CDH version since that's not released as a separate customer service so my response was geared more towards C6 rather than C5. Do you have two different versions of Kafka running on cluster 1 and cluster 2?

Online	Offline
Last Visited	‎09-29-2020 02:13 PM

Member Since	‎07-19-2017 08:54 AM
Last Visited	‎09-29-2020 02:13 PM
Posts	53
Kudos received	3

Cloudera Community

Re: Controller 1005 epoch 84 received response {er...

Re: Topic creation and deletion are not protected ...

Re: Is it possible to have two versions of Kafka i...

Re: Getting user not found issue when starting spa...

Re: Getting user not found issue when starting spa...

Re: Spark Job long GC pauses

Re: Did not receive Hortonworks Spark Certificatio...

Re: Getting user not found issue when starting spa...

Re: Controller 1005 epoch 84 received response {er...

Re: Topic creation and deletion are not protected ...

Re: Nifi Transaction with Kafka (exactly-once)

Re: Is it possible to have two versions of Kafka i...

Re: Is it possible to have two versions of Kafka i...