Support Questions

Find answers, ask questions, and share your expertise

Get files recursively from S3 bucket

avatar
Contributor

Hi

Iam new to Nifi and trying to get this resolved.

we have a S3 bucket with below structure. 

Screenshot 2024-10-21 at 7.47.21 PM.png

and using below pattern to get the file recursively.

Though this works if the files are there in home directory of the bucket.

But it doesn't work for folder or subfolder.

Please let me know how to get the files recursively.

Note: Also the same structure has to be created in local with PutFile, which I think is possible once i get the files from FetchS3Object. 

Screenshot 2024-10-21 at 7.48.15 PM.png

1 ACCEPTED SOLUTION

avatar
Contributor

Hi Matt,

Closing this query as it's not related to S3 issue.

Thanks for your response.

 

View solution in original post

8 REPLIES 8

avatar
Expert Contributor

@nifier -

I emulated the same setup as you via a testing S3 bucket.

drewski7_0-1729522787407.png

Using an access key id, and secret access key that had full access to all of S3, I was able to receive all the objects (including recursive ones) from that bucket.

drewski7_1-1729522961080.png

drewski7_2-1729523003012.png

 


A couple questions I want to follow up with ... 

1. In your ListS3 processor configuration, do you have anything set for prefix or delimiter? Just want to make sure because that could be filtering some files/directories coming from S3.

2. What is your IAM role and the bucket policy you are trying to consume from? Are you certain that the role you are using can access to all the objects in the bucket?

avatar
Contributor

Thank you for your reply.

1. In ListS3, Im able to list the files, no issues here.

Issue is with the next step of FetchS3Object which gives me below error.

Screenshot 2024-10-21 at 9.27.12 PM.png

ERROR
FetchS3Object[id=99843c65-eeb7-1140-824f-1258d088506d] Failed to retrieve S3 Object for FlowFile[filename=20242323/year/year.txt];
routing to failure: com.amazonaws.services.s3.model.AmazonS3Exception: null (Service: Amazon S3; Status Code: 404;
Error Code: NoSuchKey; Request ID: tx00000631cd5ef67d0d1fd-006716305c-c9f4b0-ttce-stage-singlesite-zone;
S3 Extended Request ID: c9f4b0-ttce-stage-singlesite-zone-ttce-stage-singlesite-zonegroup; Proxy: null),
S3 Extended Request ID: c9f4b0-ttce-stage-singlesite-zone-ttce-stage-singlesite-zonegroup

2. Yes, we do have access to the bucket. We are able to recursively get it from Shell script using AWS S3 commands.

avatar
Expert Contributor

@nifier 

 

What does your configuration for the FetchS3 processor look like?

I would say make sure your pointing to the correct region in the FetchS3 processor and make sure your AWSCredentialsService or credentials in the processor are set correctly.

avatar
Contributor

Credentials and region are correct as I'm able to fetch the file if they are directly under home directory (20242323) of the bucket. The issue is fetching files that are under folder/subfolders like "month" or "year" in this case.

 

nifier_0-1729530590518.png

 

ListS3 Properties

Screenshot 2024-10-21 at 10.33.06 PM.png

 

 

FetchS3Object Properties

Screenshot 2024-10-21 at 10.32.39 PM.png

 

avatar
Contributor

For the record

Able to resolve the fetching issue, there was port number missing in the overridden URL.

However, now i get a different error when writing the file to local disk.

PutFile[id=07413ab5-d3d2-1e9a-99b0-c0f57682f17c] Penalizing FlowFile[filename=20242323/year/year.txt] and transferring to failure: org.apache.nifi.processor.exception.FlowFileAccessException: Failed to export StandardFlowFileRecord[uuid=35019b7d-1e33-44db-8432-6a54b2f6586e,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1729535188764-436, container=default, section=436], offset=70, length=16],offset=0,name=20242323/year/year.txt,size=16] to /apps/fex/shared/mina/archive/.20242323/year/year.txt due to java.io.FileNotFoundException: /apps/fex/shared/mina/archive/.20242323/year/year.txt (No such file or directory)
- Caused by: java.io.FileNotFoundException: /apps/fex/shared/mina/archive/.20242323/year/year.txt (No such file or directory)

 

avatar
Master Mentor

@nifier 

Your putFile issue is unrelated to original query in this community question.  It is better if you start a new community questioon for unrelated queries as solutions can become confusing to others who may use the thread in the future.

That being said, this exception is cause because your NiFi  FlowFile has a filename that contains a directory structure:

20242323/year/year.txt

This is not a valid filename to use with putFile processor.  I am not sure where in your dataflow before putFile that the filename FlowFile Attribute  is being modified in such a way.   You might be able to address this issue there (preferred).

You could use an update Attribute processor to extract the directory structure from the filename before putFile processor also.

MattWho_0-1729539444091.png

 
if you want to maintain the append the extracted path from the filename to "Directory" configured in the putFile processor if you want to create that directory structure.

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

 

avatar
Contributor

Hi Matt,

Closing this query as it's not related to S3 issue.

Thanks for your response.

 

avatar
Contributor

Thanks Matt, 

Was able to resolve the issue with your putFile solution.