Member since
07-19-2017
53
Posts
3
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
941 | 08-23-2019 06:51 AM | |
1713 | 08-23-2019 06:45 AM | |
1562 | 08-20-2019 02:06 PM |
11-20-2019
12:57 PM
1 Kudo
This issue would really require further debugging. For whatever reason, at that particular time something happened with the user ID resolution. We've seen customers before that had similar issues when tools like SSSD is being used: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/sssd-system-uids One idea here is to create a shell script that runs the command 'id ptz0srv0z50' and 'id -Gn ptz0srv0z50' in a loop based on some interval. say 10, 20 or 30 seconds and when the problem occurs just go over the output of that shell script and see if you notice anything different in the output at the time of the issue.
... View more
11-20-2019
12:42 PM
@paleerbccm it's still the same issue but the log you're sharing doesn't show the details we would need. Your problem is that when the container for the ApplicationMaster attempts to launch on a particular host machine as the ptzs0srv0z50, this shell command fails because the user ID doesn't exist on this machine. What you need to do is identify where the Application Master attempted to run. You can do this from the Resource Manager's WebUI. Please refer to the following screenshot for example and note the highlighted red box: You'll see in this example that the first ApplicationMaster attempt was on the host machine worker.example.com. You would then need to SSH into that host machine and run the following command to see if this user actually exists or not: id ptzs0srv0z50
... View more
11-12-2019
02:42 PM
Hi @PARTOMIA09 One suggestion off the bat is to possibly consider moving to a G1GC policy instead given that you have relatively large heap sizes (30 GB for executors and 16 GB for the driver). Typically the G1GC policy was developed to be better performant for larger heaps (> 8 GB). Try the following and see if that helps: --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35"
--conf "spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35" "The Garbage-First Garbage Collector": https://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html
... View more
10-02-2019
06:05 AM
Hi @ravikiran_sharm I'm sorry you had to experience this mishap. I'll reach out to my managers to locate someone internally who can help you out with this. Sorry again for the inconvenience!
... View more
10-02-2019
05:49 AM
Hi @bene It looks okay to me at first glance. In fact, if there was something wrong with the privilege then I wouldn't expect it to be created in the first place. What version of CDH and Cloudera Kafka do you have running in your environment?
... View more
09-30-2019
08:50 PM
Hi @anbazhagan_muth You don't need to worry about those two configurations unless you're using Kafka MirrorMaker: Destination Broker List bootstrap.servers Source Broker List source.bootstrap.servers The Kafka MirrorMaker is used to replicate data from one Kafka service to the other. With that said, the configurations should be self explanatory, where the source broker list (source.bootstrap.servers) is the list of your brokers in the source Kafka service the MirroMaker is going to read data from, and the destination broker list (bootstrap.servers) is the list of brokers in your destination Kafka service where the MirrorMaker is going to write the data to. This is a comma separated list and the format would be something like: BROKER1_HOSTNAME:PORT_NUMBER, BROKER2_HOSTNAME:PORT_NUMBER PORT_NUMBER is going to be either 9092 for PLAINTEXT or SASL_PLAINTEXT, or 9093 for SSL or SASL_SSL.
... View more
09-30-2019
08:40 PM
1 Kudo
Hi @paleerbccm The issue here is that this user ID doesn't exist on one of your YARN NodeManager machines. That probably also explains the randomness of the issue. So long as a container doesn't end up allocated on that machine then you won't run into any problems. You need to find where the containers fail to launch with that exception, then SSH into the host machine and confirm the problem by running: id pyqy0srv0z50 You will then need to be sure to create this user on that machine and make sure that the user's group membership matches whatever you have on all your other hosts.
... View more
08-23-2019
06:51 AM
@paleerbccm Briefly looking at the message, I would assume 'error_code=0' actually means that no errors occurred. It would need quite a bit of digging in the code to understand, but generally speaking, I wouldn't worry too much about TRACE level logs. Ideally, and especially that this is a production environment, you would normally set logging level to INFO and that's about all you would need. Unless you have an intimate knowledge of the code and you're chasing after a specific issue, it's rare that you would ever need TRACE level logs.
... View more
08-23-2019
06:45 AM
Hi @iamabug It's a known limitation in Kafka where the kafka-topics tool communicates directly with Zookeeper. When you create a topic, all the tool does is connect to Zookeeper, creates a znode representing this topic and then sets some data as a JSON string (the metadata for the topic). There has been work to develop Java admin clients which made some progress: https://cwiki.apache.org/confluence/display/KAFKA/KIP-117%3A+Add+a+public+AdminClient+API+for+Kafka+admin+operations#KIP-117:AddapublicAdminClientAPIforKafkaadminoperations-FutureWork However, all that's left is to have command line tools that leverage those Java APIs and that's a work in progress: https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations
... View more
08-23-2019
06:29 AM
Hi @HarpreetSingh31 It's not clear to me what the issue is when you say you have problems running the producer and consumer. The outputs you're seeing are normal and it looks to be functioning as expected. I'm going to assume here that the problem you're referring to is that you can't seem to be reading the messages you're sending to the topic. Based on the information you posted, where you highlighted that you're only running on Kafka broker, I believe the problem here is that you need to go and change the Kafka configuration offsets.topic.replication.factor and be sure to set that to 1. In Cloudera Manager we have always set that to 3 by default and I had filed an improvement internally about this to ensure that we do not set this to three when a user installs a new Kafka service with brokers less than that number. If you look at the Kafka broker log you'll see an error like the one below or something similar: Number of alive brokers '1' does not meet the required replication factor '3' for the offsets topic (configured via 'offsets.topic.replication.factor'). This error can be ignored if the cluster is starting up and not all brokers are up yet. You can make the change from: Cloudera Manager > Kafka > Configuration > Search for 'offsets.topic.replication.factor' After you change this value you will need to restart your Kafka service. Be aware that once you set this to 1, your internal __consumer_offsets topic (used by consumers to commit their offsets) will be created with a single replication factor and this won't change even as you add more brokers to your cluster. If in the future you need to add more brokers then you will have to expand the replication factor for this topic using the kafka-reassign-partition tool: https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools Hope this helps!
... View more
08-21-2019
06:27 AM
@nicolas_gernigo I'm kind of curious about this issue. Did you manage to figure this one out? I tried to look through the Kafka producer code a little bit to get a better understanding of what might have happened but it needs a bit more digging. So far what I have found is that a transaction by the Kafka producer will maintain different states throughout its lifetime, where it would start with an UNINITIALIZED state. However, a starting state can never be FATAL_ERROR and so the attempt by the Kafka producer to try and initialize this transaction and change its state from FATAL_ERROR to INITIALIZING is an invalid one. This all seems to imply from my understanding that there was another issue that happened beforehand causing this particular transaction to fail in the first place. I don't see why the Kafka producer would stop sending any messages all together after this failure, that seems to be an issue in possibly the Kafka client. Could you try enabling the ' use Transactions' property for your PublishKafkaRecord processor but at the same time also be sure to change the topic name to a new one? Be sure that this topic is a newly created one that hasn't been written to previously. Please do let me know if you experience the same issue or not once you make that change.
... View more
08-21-2019
06:06 AM
Thanks for confirming. Yes, I see that this is in fact possible in CDH 5 as well when I did a quick check. Just remember that starting with CDH 6.0, both Spark 2.x and Kafka are bundled in CDH, so the only way you would have two different versions is by running two different CDH versions from Cloudera Manager.
... View more
08-21-2019
05:59 AM
Hi @iamabug Yes, you can definitely have two different clusters managed by the same Cloudera Manager instance. In return, you can have different CDH versions for the two clusters, however, my initial thoughts was that you wouldn't be able to activate two different Kafka parcel versions, for example CDK 3.0 on Cluster A and CDK 2.2 on Cluster B. But I need to double check this because I have some doubts if that is in fact the case in CDH 5.x or not. In CDH 6.x you're bound to a specific version of Kafka depending on your CDH version since that's not released as a separate customer service so my response was geared more towards C6 rather than C5. Do you have two different versions of Kafka running on cluster 1 and cluster 2?
... View more
08-20-2019
08:04 PM
Two different clusters managed by two different Cloudera Managers.
... View more
08-20-2019
03:34 PM
Hi @raghu_nt Thanks for posting this. It might help if you can give the community a bit more information on what the problem is. You say that you're unable to create more than 100 topics using the ProduceKafka network processor in NiFi, is there a specific issue that you experience once you go beyond that limit? Are you seeing any errors?
... View more
08-20-2019
03:28 PM
1 Kudo
Hi @Prav You're right, it's pretty generic, but this usually occurs if your containers were killed due to memory issues. This can either be a java.lang.OutOfMemorError thrown by the executor running in that container, or possibly the container's JVM process' physical memory grew beyond its memory limits. Meaning, if your application was configured with 1 gb of executor memory (spark.executor.memory) and 1 g for executor memory overhead ( spark.executor.memoryOverhead), then the container size request here would be 2 gb. If the process' memory goes beyond 2 gb then YARN is going to kill that process. Really, the best way of identifying the issue is by collecting the YARN logs for your application and going through that: yarn logs -applicationId 1564435356568_349499 You would just run that from your edge node or NodeManager machines (assuming you're running Spark on YARN).
... View more
08-20-2019
02:18 PM
Hi @sauravsuman689 A common issue that people have when using the kafka-consumer-group command line tool is that they do not set it up to communicate over Kerberos like any other Kafka client (i.e. consumers and producers). The security.protocol output you shared based on the cat command doesn't look right: cat /tmp/grouprop.properties
security.protocol=PLAINTEXTSASL This should instead be: security.protocol=SASL_PLAINTEXT
sasl.kerberos.service.name=kafka You can use the same instructions outlined in the following link starting with step number 5: https://www.cloudera.com/documentation/kafka/latest/topics/kafka_security.html#concept_lcn_4mm_s5 I understand you're using HDP but it should be pretty much the same steps. You will of course just use the same command line tool command you're using as opposed to the consumer command mentioned in the link: [kafka@XXX ~]$ /usr/hdp/current/kafka-broker/bin/kafka-consumer-groups.sh --bootstrap-server xxxx:6667,xxxx:6667,xxxx:6667 --list --command-config /tmp/grouprop.properties EDIT: It seems like HDP works a bit differently so your security.protocol parameter aligns with what the HDP platform would expect.
... View more
08-20-2019
02:06 PM
Hi @iamabug This is not possible unfortunately. You could only have one parcel for a specific service activated at any given time. You would need to have a separate cluster managed by a different Cloudera Manager installation in order to activate a different Kafka parcel version.
... View more
08-15-2019
08:19 PM
Hi Chittu, Your issue here is that your JVM process is running out of memory, specifically heap space: java.lang.OutOfMemoryError: Java heap space Judging from the output you shared, I believe this is your driver that's running out of memory and so you would need to increase the maximum heap size for the driver. That's done by configuring the spark.driver.memory parameter or by passing the --driver-memory flag to the Spark command being used.
... View more
08-15-2019
08:09 PM
Hi Allen, You would typically use the NameNode nameservice when you have high availability enabled in HDFS. It's the representation of your active NameNode server at the time and the standby NameNode server. At any point these two servers can switch roles (from active to standby and vice versa), so by using the nameservice the connection between your client and HDFS is done seamlessly. You shouldn't need to know which one of your servers is the active NameNode and that's not something you can guarantee. Hope that helps!
... View more
12-05-2018
07:58 AM
Just to be clear, you're only deleting data for the specific partitions that are impacted and not everything under the broker's data directory. I just wasn't sure what you meant by rm -rf here so wanted to clarify. Good luck, and please do let us know of the outcome.
... View more
12-05-2018
07:26 AM
It's a fairly new issue that I personally haven't seen before with any of the current customers running on the Cloudera Distribution of Kafka, but the latest versions released (Cloudera Distribution of Kafka 3.1.1) and Kafa in CDH 6.0 is based on Apache Kafka 1.0.1. The plan for CDH 6.1 is to rebase Cloudera Kafka to Apache Kafka 2.0, so it's probably just a matter of time till this becomes a more common issue. You mentioned that restarting the Kafka service would then cause the problematic partitions to change. Is that the case when you only shutdown a single broker and start it up again? I'm asking because one potential way to work around this is to identify which broker is lagging behind and not joining the ISR, shutdown the broker, delete the topic partition data (for the affected partitions) from disk and then start up the broker again. The broker will start and self heal by replicating all the data from the current leader of those partitions. Obviously this can take a long time depending on how many partitions are affected and how much data needs to be replicated.
... View more
11-29-2018
05:56 PM
Hi desind, I'm not sure there's a way to force it so sync here. From what you're describing and the error you shared, I think what's happening here is that the replica fetcher thread fails and the broker stops replicating data from the leader. That would explain why you see the broker out of sync for a long time. Are you using Cloudera's distribution of Kafka or is this Apache Kafka? What version are you using? I see that someone reported a similar issue very recently: https://issues.apache.org/jira/browse/KAFKA-7635
... View more
07-19-2018
01:55 PM
Hello ChelouteMS, When I look at the chart and compare it with what you have described where you noticed the scheduling delay increased once the number of devices publishing data to Kafka topics increases, I don't see anything alarming. It's a classic issue where you being to have batches pile up and scheduling delays increasing as you increase the amount of data that needs to be processed. The odd thing I see is that you seem to have 51 containers that were allocated even though you mentioned that you specifically asked for 12 executors with 1 core each. I would then expect your application to have 13 containers in total (12 executors + 1 for an Application Master). How are you specifying the number of executors? Do you have dynamic allocation enabled? If so, I would try disabling that: https://www.cloudera.com/documentation/enterprise/5-14-x/topics/spark_streaming.html#section_nhw_jpp_45 The next place to look is the output of your spark2-submit command (Driver output since you're running in client mode) and the Application Master's log to confirm if the application did in fact ask for that many containers or not.
... View more