Created 02-28-2023 01:56 AM
We have around 6k tables in HDFS. All these tables are stored as sub-dirctories under a parent directory. As of now we're passing the root directory and recursive subdirectories as true in LISTHdfs processor. And later using routeonattribute processor to filter out the relevant tables using the path name. But the problem here is LISTHDFS is taking larger time just to list out as it needs to go through 6K directories. Is there is anyway possible to pass only required directories in listhdfs processor or any workaround available for this?
Created 02-28-2023 03:58 AM
From provided details, It's not clear how the subdirectory structure looks immediately after root directory.
The approach could be using multiple ListHDFS configured with an immediate subdirectory of root with recursive subdirectories as true.
The flow will look like a Multiple of ListHDFS (Primary )--->All connected to one FetchHDFS (All Nodes ) ith LB at connection.
If you found this response assisted with your issue, please take a moment to login and click on "Accept as Solution" below this post.
Created 02-28-2023 05:16 AM
The folder structure is like /A/B, /A/C, /A/D, /A/D,..... Like this A is the root directory and there are 6000 subdirectories. We only need to read from 200 specific sub directories. Creating 200 ListHdfs Processors seems to be cluttered. Is there any workaround for this?
Created 02-28-2023 05:47 AM
Thank you for the updated information.
So if you are not listing from each and every sub dir under root dir then you can use a regex pattern that just satisfies directory of your interest, Property Direceoty supports expression language.
Created 02-28-2023 08:03 AM
That should help. Can you give us a example to apply the same.
Created 03-05-2023 11:24 PM
@FROZEN2, if the reply has resolved your issue, can you please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future?