Created 05-10-2023 12:43 PM
Hi all,
I have several jobs processing daily files.
In short, I load zipped files from a S3 directory, extract the files, and reload them into the same directory into a specific folder created by NiFi.
The workflow works great, but it looks like some files are not being processed. I checked NiFi for any errors or warnings but it seems like everything is working fine.
Has anyone experience a similar problem?
If so, how do you fixed it?
Thank you
Created 05-11-2023 12:08 AM
@acasta,
What do you mean when saying that some files are not being processed? Are you not extracting all the ZIP Files from S3 or are the files extracted out of the zip files not present in your newly created folder?
Have you checked if the files which are getting extracted have the same name? For example in zip 1 you have a file called ingested_data.csv and in your zip 2 you have the same exact file, but with different content? If this is the case when you files get saved in your folder (no matter the if we are talking about S3, GCP, PutFile, Azure or anything else) they will get overwritten with the latest file.
Created 05-11-2023 08:13 AM
@cotopaul Thanks for replying.
Apologize for not explaining myself properly. Here is the situation.
The inititial data in the directory s3.bucket/data/to/be/processed/ looks like the following:
file0000.tar.gz
file0015.tar.gz
...
file2345.tar.gz
where 0000-0015-2345 are timestamp in 15 minutes increments.
The end result should look like the following (same directory s3.bucket/data/to/be/processed/):
file0000/exctraded_files
file0015/extracted_files
...
file2345/extracted_files
where the folder is named after the original file, and it contains the extracted files.
What I am experiencing is that most of the folders are created as expected and the files are correctly extracted. However, I often get two or three files that seem not to get processed. I checked as you suggested if it might be a naming issue but that's not the case. I run another job pointing at those files who were extracted in the first place and the end result is what I expect.
Hope this was clear.
Thanks for the help.
Created 05-12-2023 12:37 AM
@acasta,
Don't get me wrong, but I highly doubt that NiFi is ignoring/deleting somehow the files, without your intervention or configurations. What I would suggest you are the following two actions:
- First of all, add an LogMessage/LogAttribute after your have unzipped all those files. Basically, double your success queue from your processors where you unzip your tar file and log each file which was extracted. In this way, you get a list with all the files extracted out of your zip file. Make sure to set the Queue as single node to check nifi-app.logs on a single node.
- Next, add another LogMessage/LogAttribute after your processor with which you save the data into your Bucket. Send the name of the unzipped files into the logs to get a list with all the files which have saved into your bucket.Make sure to set the Queue as single node to check nifi-app.logs on a single node.
Afterwards, you can compare the lists and see if you have extracted and saved all your files. If the lists are 1:1, it means that the problem is not related to NiFi itself, but to something else: like another system doing something in your bucket, having files with the same name which get over written, etc.
Another option would be to use DEBUG on all your processors and use RUN ONCE until you process everything you have to process and analyze in real time what is happening.