Member since
07-19-2017
53
Posts
3
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
953 | 08-23-2019 06:51 AM | |
1716 | 08-23-2019 06:45 AM | |
1570 | 08-20-2019 02:06 PM |
06-24-2021
04:06 AM
I'm seeing the same issue. I can see "Transition from state INITIALIZING to error state FATAL_ERROR" once I set " Use Transactions "="true" and " Delivery Guarantee "=" Guarantee Replicated Delivery ".
... View more
02-19-2020
04:44 PM
1 Kudo
@WilsonLozano,
As this thread is older and was marked 'Solved back in August of 2019 you would have a better chance of receiving a resolution by starting a new thread. This will also provide the opportunity to provide details specific to your environment, version of CDH, etc. that could aid others in providing a more accurate answer to your question.
... View more
01-06-2020
09:26 AM
Hi, As mentioned in the previous posts, did you tried increasing the memory and whether it solved the issue? Please let us know if you are still facing any issues? Thanks AKR
... View more
11-20-2019
12:57 PM
1 Kudo
This issue would really require further debugging. For whatever reason, at that particular time something happened with the user ID resolution. We've seen customers before that had similar issues when tools like SSSD is being used: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/sssd-system-uids One idea here is to create a shell script that runs the command 'id ptz0srv0z50' and 'id -Gn ptz0srv0z50' in a loop based on some interval. say 10, 20 or 30 seconds and when the problem occurs just go over the output of that shell script and see if you notice anything different in the output at the time of the issue.
... View more
11-12-2019
04:45 PM
Hi w@leed Thanks for Replying. I did test the Job with all the three Collectors - ParallelGC, CMS and G1GC: I has tested following options with the G1GC: -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps and with CMS: -XX:+UseConcMarkSweepGC -XX:+PrintGCTimeStamps -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseParNewGC -XX:+CMSConcurrentMTEnabled -XX:ParallelCMSThreads=10 -XX:ConcGCThreads=8 -XX:ParallelGCThreads=16 With G1GC defaults, I could see following: Desired survivor size 1041235968 bytes, new threshold 5 (max 15) [PSYoungGen: 1515304K->782022K(3053056K)] 2750361K->2017087K(6371840K), 1.5875321 secs] [Times: user=4.72 sys=0.74, real=1.59 secs] Heap after GC invocations=9 (full 3): PSYoungGen total 3053056K, used 782022K [0x0000000580000000, 0x000000068ef80000, 0x0000000800000000) eden space 2270720K, 0% used [0x0000000580000000,0x0000000580000000,0x000000060a980000) from space 782336K, 99% used [0x000000065f380000,0x000000068ef31ab0,0x000000068ef80000) to space 1016832K, 0% used [0x0000000612d80000,0x0000000612d80000,0x0000000650e80000) ParOldGen total 3318784K, used 1235064K [0x0000000080000000, 0x000000014a900000, 0x0000000580000000) object space 3318784K, 37% used [0x0000000080000000,0x00000000cb61e318,0x000000014a900000) Metaspace used 55055K, capacity 55638K, committed 55896K, reserved 1097728K class space used 7049K, capacity 7207K, committed 7256K, reserved 1048576K } {Heap before GC invocations=10 (full 3): PSYoungGen total 3053056K, used 3052742K [0x0000000580000000, 0x000000068ef80000, 0x0000000800000000) eden space 2270720K, 100% used [0x0000000580000000,0x000000060a980000,0x000000060a980000) from space 782336K, 99% used [0x000000065f380000,0x000000068ef31ab0,0x000000068ef80000) to space 1016832K, 0% used [0x0000000612d80000,0x0000000612d80000,0x0000000650e80000) ParOldGen total 3318784K, used 1235064K [0x0000000080000000, 0x000000014a900000, 0x0000000580000000) object space 3318784K, 37% used [0x0000000080000000,0x00000000cb61e318,0x000000014a900000) Metaspace used 55108K, capacity 55702K, committed 55896K, reserved 1097728K class space used 7049K, capacity 7207K, committed 7256K, reserved 1048576K 42.412: [GC (Allocation Failure) Desired survivor size 1653080064 bytes, new threshold 4 (max 15) [PSYoungGen: 3052742K->1016800K(3422720K)] 4287807K->2985385K(6741504K), 4.0304873 secs] [Times: user=11.87 sys=1.77, real=4.03 secs] Heap after GC invocations=10 (full 3): PSYoungGen total 3422720K, used 1016800K [0x0000000580000000, 0x0000000727a80000, 0x0000000800000000) eden space 2405888K, 0% used [0x0000000580000000,0x0000000580000000,0x0000000612d80000) from space 1016832K, 99% used [0x0000000612d80000,0x0000000650e78240,0x0000000650e80000) to space 1614336K, 0% used [0x00000006c5200000,0x00000006c5200000,0x0000000727a80000) ParOldGen total 3318784K, used 1968584K [0x0000000080000000, 0x000000014a900000, 0x0000000580000000) object space 3318784K, 59% used [0x0000000080000000,0x00000000f8272318,0x000000014a900000) Metaspace used 55108K, capacity 55702K, committed 55896K, reserved 1097728K class space used 7049K, capacity 7207K, committed 7256K, reserved 1048576K With all the Collectors only difference I could see was that, a delayed full GC. I am considering to changing the YoungGen now. Will update if I do see a difference. On a parallel note - 1. I did also see that there are some of the objects in the memory which remain persistent across GC cycles - for example : scala.Tuple2 and java.lang.Long 2. These are Java RDD's Regards
... View more
10-04-2019
03:59 AM
Hmm, I can now reproduce the issue. After creating the privilege: kafka-sentry --config /etc/sentry/conf -gpr -r eric-test -p 'HOST=*->CLUSTER=kafka-cluster->action=clusteraction' It is stored as "cluster_action": kafka-sentry --config /etc/sentry/conf -lp -r eric-test
...
HOST=*->CLUSTER=kafka-cluster->action=cluster_action And when try to drop it, it will fail with the error you are seeing. Need a bit more time to look into why.
... View more
10-02-2019
07:23 PM
@ravikiran_sharm we've passed along your concerns and note of frustration to the relevant parties internally and they are actively working on your case. They say they are working with you directly to get this resolved.
... View more
09-30-2019
08:50 PM
Hi @anbazhagan_muth You don't need to worry about those two configurations unless you're using Kafka MirrorMaker: Destination Broker List bootstrap.servers Source Broker List source.bootstrap.servers The Kafka MirrorMaker is used to replicate data from one Kafka service to the other. With that said, the configurations should be self explanatory, where the source broker list (source.bootstrap.servers) is the list of your brokers in the source Kafka service the MirroMaker is going to read data from, and the destination broker list (bootstrap.servers) is the list of brokers in your destination Kafka service where the MirrorMaker is going to write the data to. This is a comma separated list and the format would be something like: BROKER1_HOSTNAME:PORT_NUMBER, BROKER2_HOSTNAME:PORT_NUMBER PORT_NUMBER is going to be either 9092 for PLAINTEXT or SASL_PLAINTEXT, or 9093 for SSL or SASL_SSL.
... View more
08-23-2019
06:51 AM
@paleerbccm Briefly looking at the message, I would assume 'error_code=0' actually means that no errors occurred. It would need quite a bit of digging in the code to understand, but generally speaking, I wouldn't worry too much about TRACE level logs. Ideally, and especially that this is a production environment, you would normally set logging level to INFO and that's about all you would need. Unless you have an intimate knowledge of the code and you're chasing after a specific issue, it's rare that you would ever need TRACE level logs.
... View more
08-23-2019
06:29 AM
Hi @HarpreetSingh31 It's not clear to me what the issue is when you say you have problems running the producer and consumer. The outputs you're seeing are normal and it looks to be functioning as expected. I'm going to assume here that the problem you're referring to is that you can't seem to be reading the messages you're sending to the topic. Based on the information you posted, where you highlighted that you're only running on Kafka broker, I believe the problem here is that you need to go and change the Kafka configuration offsets.topic.replication.factor and be sure to set that to 1. In Cloudera Manager we have always set that to 3 by default and I had filed an improvement internally about this to ensure that we do not set this to three when a user installs a new Kafka service with brokers less than that number. If you look at the Kafka broker log you'll see an error like the one below or something similar: Number of alive brokers '1' does not meet the required replication factor '3' for the offsets topic (configured via 'offsets.topic.replication.factor'). This error can be ignored if the cluster is starting up and not all brokers are up yet. You can make the change from: Cloudera Manager > Kafka > Configuration > Search for 'offsets.topic.replication.factor' After you change this value you will need to restart your Kafka service. Be aware that once you set this to 1, your internal __consumer_offsets topic (used by consumers to commit their offsets) will be created with a single replication factor and this won't change even as you add more brokers to your cluster. If in the future you need to add more brokers then you will have to expand the replication factor for this topic using the kafka-reassign-partition tool: https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools Hope this helps!
... View more
08-21-2019
11:45 PM
Hi Feloix, Yes, I also tried kinit before executing spark-submit, but failed with the same error. The Spark job is accepted in Yarn. I can see the job become accepted status in the logs. If the kerberos authentication is failed, it's usually failed before yarn accepts the job. It seems the error is caused by the authentication token isn't correctly passed to resource manager or node managers. Only if I defined a name service and used it in the fs.defaultFS parameter, the spark job can be successfully completed in Yarn.
... View more
08-21-2019
11:28 AM
Thanks that does show more information. Though what i find weird is the same query has run with a large load earlier (with same config params) and now has failed (from the logs: java.lang.OutOfMemoryError: Java heap space). Regards
... View more
- Tags:
- Spark
08-21-2019
09:08 AM
I have a C# microservice that is running in cloud that continuously keeps receiving data from about 20k devices in the field. Everytime this microservice receives data it passes the same to Nifi PublishKafka processor which in turn is configured to create a new topic with the pattern TL-<DeviceSerialNo> . PublishKafka places the device data on the topic and publishes to kafka. So lets say TL-0001.... TL-1000, TL-1299...TL-20000 are the kafka topics that are supposed to be created by Nifi. But when I go to the Kafka broker host what I find is that after TL-99 there are no new topics created.
... View more
08-21-2019
06:10 AM
Now I am really clear about the situation. Thanks a lot for your replies.
... View more
08-20-2019
02:18 PM
Hi @sauravsuman689 A common issue that people have when using the kafka-consumer-group command line tool is that they do not set it up to communicate over Kerberos like any other Kafka client (i.e. consumers and producers). The security.protocol output you shared based on the cat command doesn't look right: cat /tmp/grouprop.properties
security.protocol=PLAINTEXTSASL This should instead be: security.protocol=SASL_PLAINTEXT
sasl.kerberos.service.name=kafka You can use the same instructions outlined in the following link starting with step number 5: https://www.cloudera.com/documentation/kafka/latest/topics/kafka_security.html#concept_lcn_4mm_s5 I understand you're using HDP but it should be pretty much the same steps. You will of course just use the same command line tool command you're using as opposed to the consumer command mentioned in the link: [kafka@XXX ~]$ /usr/hdp/current/kafka-broker/bin/kafka-consumer-groups.sh --bootstrap-server xxxx:6667,xxxx:6667,xxxx:6667 --list --command-config /tmp/grouprop.properties EDIT: It seems like HDP works a bit differently so your security.protocol parameter aligns with what the HDP platform would expect.
... View more
12-06-2018
01:33 PM
1 Kudo
Yes we only tried deleting the out-of-sync partition. It did not work. After a lot of research we came to a conclusion to increase replica.lag.time.max.ms to 8 days. As its been around 8 days that a few replicas were out of sync. This resolved our issue and while it took a few hours for followers to fetch and replicate the 7 days of data. https://www.confluent.io/blog/hands-free-kafka-replication-a-lesson-in-operational-simplicity/ helped to understand the ISR's
... View more
07-19-2018
01:55 PM
Hello ChelouteMS, When I look at the chart and compare it with what you have described where you noticed the scheduling delay increased once the number of devices publishing data to Kafka topics increases, I don't see anything alarming. It's a classic issue where you being to have batches pile up and scheduling delays increasing as you increase the amount of data that needs to be processed. The odd thing I see is that you seem to have 51 containers that were allocated even though you mentioned that you specifically asked for 12 executors with 1 core each. I would then expect your application to have 13 containers in total (12 executors + 1 for an Application Master). How are you specifying the number of executors? Do you have dynamic allocation enabled? If so, I would try disabling that: https://www.cloudera.com/documentation/enterprise/5-14-x/topics/spark_streaming.html#section_nhw_jpp_45 The next place to look is the output of your spark2-submit command (Driver output since you're running in client mode) and the Application Master's log to confirm if the application did in fact ask for that many containers or not.
... View more