Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

ListS3 challenges with large pre-existing datasets

Highlighted

ListS3 challenges with large pre-existing datasets

New Contributor

Hello All,

Reasonably new to NiFi and have some questions around best practices for ListS3.

My use case is I am indexing an S3-compatible object storage platform full of millions of images into ElasticSearch using the meta-data from the images.

I wanted to do this by pointing ListS3 at some top level buckets, but recognize at this scale of objects getting the processor to a place of good internal state will take a very long time.

My images have keys that are based on a project ID - i.e. Images/project-12345/*

This led to the idea of using the prefix field for each project, however I encountered a few odd behaviors there:

- I have many sub-projects under a top level project, each containing 10's to 100's of thousands of images themselves. What I see is ListS3 keeps sending me the same sub-project (i.e. Images/project-12345/sub-abc/) over and over again, never moving to a different sub-project - currentTimestamp never seems to update beyond 0, but the keys ListS3 is using for its state are getting updated regularly

I can take the approach of using the sub project prefix to load this data, however that means manually configuring several processors, which sort of defeats the purpose.

Are there best practices around running ListS3 on established buckets with lots of keys/objects? I know the AWS crowd likes SNS/SQS, but I don't have that option here due to this being an on-premises S3 compatible platform.

Any help appreciated,

Charlie

Don't have an account?
Coming from Hortonworks? Activate your account here