Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to list from S3 recursively on a prefix in the middle of the flow

avatar
Contributor

I am able to get the details of the prefix I need to recursively list only after execution of four initial processors. Since I can not use ListS3 processor in the middle of the flow (It does not take an incoming relationship). How can I list the prefix in S3 recursively.

I fetch a json file from S3 bucket that contains the prefix information. This prefix changes daily. Then I need to list the prefix recursively.

aws s3 ls s3://{Bucket Name}/{prefix}/ --recursive
2 REPLIES 2

avatar

Hi @Identity One ,

You're right, it's unfortunate the ListS3 processor doesn't accept an incoming flowfile to provide a prefix. You might want to enter an Apache jira (https://issues.apache.org/jira/projects/NIFI : "Create" button) requesting such a feature.

In the meantime, you can use the ExecuteScript processor to construct and run an appropriate s3 list command, with prefix, in response to incoming flowfiles. You can use any of the ExecuteScript allowed scripting languages that have a corresponding AWS SDK client library; see https://aws.amazon.com/tools/#sdk -- looks like the choices are python, ruby, or js. If you click through any of those links in the aws sdk page, under "Documentation" : "Code Examples" : "Amazon S3 Examples", it should help. For example, the Python S3 examples are here: https://boto3.readthedocs.io/en/latest/guide/s3-examples.html .
The "list" api is here: https://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2

If you're not into scripting to the SDK, you could use your favorite scripting language's escape to shell (like Python's 'subprocess' feature), and invoke the AWS S3 command-line-interface commands (which you're evidently already familiar with). It's crude (and slower) but it would work 🙂

Hope this helps.

avatar

There's a third option: You could use the InvokeHTTP processor. In it, use the Expression Language to construct the RemoteURL, to directly build an AWS S3 web API call, to get the listing you need. AWS web API is documented here: http://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html

However, as that document says, "Making REST API calls directly from your code can be cumbersome. It requires you to write the necessary code to calculate a valid signature to authenticate your requests. We recommend the following alternatives instead: \[AWS SDK or AWS CLI\]" -- which are the approaches I gave above.