Member since
05-01-2023
8
Posts
0
Kudos Received
0
Solutions
05-22-2023
08:27 AM
@cotopaul Please go through my latest response before concluding its a hardware or setup issue. FYI, Nifi is working fine for more than a week for me since I removed the ExecuteScript Processor. That's the only change I did and I replicated the issue several times before posting here. Nifi has restarted several times since then without any issues. could you please care to explain what sort of hardware issue it could be that it affects only ExecuteScript processor running python code?
... View more
05-16-2023
12:21 AM
Sorry for the late response. I wanted to come back with more information. Yeah, I am sure that I replace flow.xml.gz to recover from this issue. As I mentioned, I have configured default flow file to flow.xml.gz using nifi properties, so it uses xml and as you said newer versions support json, I tried using json too. Also I can confirm that shutdown is graceful. After much debugging and root cause analysis, I am able to reproduce the issue now and have the exact cause of error. Logs don't concur with my findings, but I seek your and community's help in confirming the issue and finding a resolution. My Findings - 1) Nifi was failing for me in 1st restart itself. 2) From logs, its clear that It tries to Load flow , then runs probably DTD validation on it, throws warning and then shuts down. 3) These warnings comes with newly created basic flow(with couple of processors) as well, and new flow was able to load post warning, It moves ahead from the point where it throws warning. 4) If I put point 2 and 3 together, It was safe to logically assume that there is something in Flow itself causing nifi to fail on restart. 5) Everything in Flow was plain nifi processors doing bunch of tasks, However I had a ExecutScript block with python code to do some modifications on flowfile content. so, I created a fresh flow with just 3 processors - GenerateFlowFile, ExecuteScript, LogAttribute In GenerateFlowfile I configured some random content. In ExecuteScript I copied code from https://community.cloudera.com/t5/Support-Questions/How-to-Iterate-the-Flow-file-in-Nifi-using-Python/m-p/242697 after indenting it properly and then just terminated the flow on LogAtrribute. Basically a very simple flow to do some modificaion on flowfile content. That's it I kept them running , it was working fine. So, I went ahead and deleted the pod, logs indicated graceful shutdown. Pod restarted and got the exact same error, with warnings on schema and error server is shutting down. I took fresh nifi instances and did it several times and it fails everytime with ExecuteScript block running python code. Logs are not very informative but I can confirm that its the root cause of the issue. Can you or anyone else confirm if its replicable, if yes, what should be the resolution? Nifi version used - 1.19.1
... View more
05-08-2023
10:38 AM
Thanks for your response. It was very helpful and informative. I am using Nifi 1.19 version and have tried using flow.json.gz and ran into same error there. What I understand from ShutdownHook code is that whenever a SIGTERM signal is issued from k8s, it starts the shutdown and if it doesn't shutdown in the grace period configured in nifi.properties, the core process itself is forcefully destroyed. As you said, the chances of getting flowfile corrupted is pretty slim as it needs many things to happen in succession. I was wondering what else could cause flowfile to go bad. Also in the restart logs, I never saw error saying that flowfile has gone bad, it throws warning with schema errors and just shuts down. Only when i replace the flow xml or rename the default flow file to something else, the warning message disappears and nifi starts up. Another information, I would like to bring in your notice is that it happens in our prod environments too, where flow is essentially read-only for everyone, so we are never making any change to the flow. Also, we have checkpointing enabled, every few minutes it archives the current flow state. Does it function same as making change to flow, i.e Archive the current flow xml and create new one? If yes, then it could fit in the rare sequence of events causing flow to corrupt. Another thing, Is it possible to configure nifi startup behaviour such that, if flow xml is corrupt, it ignores it and startup with new flow xml? As in Higher environments, losing the flow is not that big a concern, than to deploy nifi again with new flow xml. we have nifi registry, our own flow deployment tools to sync the flow across environments to recover the flow quickly but nifi crashing every time something goes wrong with flow xml doesn't help to maintain High availability/resiliency. Looking forward for your response.
... View more
05-04-2023
02:30 AM
@cotopaul wrote: Assuming that you are running on Linux, you need to find your operating system logs. In most linux distribution, those logs are to be found in /var/log. Now, every sysadmin configures each server by your company rules and requirements so I suggest you speak with the team responsible for the linux server and ask them to provide you with the logs. In these logs, you might find out why you are receiving that error in the first place. Unfortunately, this problem is not really related to NiFi, but to your infrastructure or to how somebody uses your NiFi instance. Somebody is doing something and you need to find out who and what 😞 @cotopaul wrote: Assuming that you are running on Linux, you need to find your operating system logs. In most linux distribution, those logs are to be found in /var/log. Now, every sysadmin configures each server by your company rules and requirements so I suggest you speak with the team responsible for the linux server and ask them to provide you with the logs. In these logs, you might find out why you are receiving that error in the first place. Unfortunately, this problem is not really related to NiFi, but to your infrastructure or to how somebody uses your NiFi instance. Somebody is doing something and you need to find out who and what 😞 I totally agree with you, but i want to just confirm one thing. I am running my nifi on eks, and kubernetes seems to restart pod every once in a while based on memory usage or other metrics it keeps hold of. The pattern which I see in logs is nifi shuts down just before my flowfile gets corrupted. So , i am making the assumption that it's probably abrupt shutdown triggered by EKS , causing the flowfile to corrupt. Now, the bigger question is whatever be the reason for shutdown, if nifi is receiving shutdown signal and shutting down. Ain't it supposed to be shutdown gracefully? I tried replicating this in lower environment, and whenever I terminate the running pod directly, it shuts down the nifi and corrupts the flow XML 8/10 times, making me believe that 8 times it got shutdown while updating flow XML. Is there a way i can confirm if nifi keeps flow intact in case of regular shutdown or is it something like an expected behaviour? Because we expect pods to be restarted with kubernetes, one of the reasons we use multi node architecture is that if one node goes down , others will keep application running while the down nodes would be brought back up behind the scenes. But here, if we lose the node altogether everytime it runs into an issue, eventually we will end up with 0 functional nodes which is currently happening. How are we supposed to address that ?
... View more
05-03-2023
09:49 PM
This shutdowns seems to be totally random as of now. The logs which I have access to just says what I mentioned. "Received trapped signal.. shutting down". I am trying to locate the source of the issue but logs are not helping. Could you tell me which log files to look into. I will try to get my hands on them in lower environment and see if i can locate the source.
... View more
05-03-2023
08:11 AM
I am doing Root cause Analysis, and source of issue seems to be abrupt nifi shutdown resulting in flow file corruption. The logs before first error - Received Trapped Signal.. Shutting down Apache Nifi has accepted the shutdown signal and is shutting down now. Failed to determine if process 165 is running or nor, assuming it is not. Further information- I am using 3 node nifi cluster, with 32GB memory. Heap memory allocated is 24GB.
... View more
05-01-2023
11:49 PM
Hi All, I have Nifi running on EKS cluser with 3 nodes. Every now and then, flow.xml.gz gets corrupted in one of the node and it fails to start back up again. Same things happens with other nodes sooner or later and ultimately NiFi crashes. The error from nodes startup logs - schema validation Error parsing Flow configuration at line <line_no>, col <col no > : cvc-complex-type:2.4.d : Invalid content was found starting with element 'property'. No child element is expected at this point. It has become a pain to maintain Nifi because of this issue. Please suggest a solution.
... View more
Labels:
- Labels:
-
Apache NiFi