Support Questions

Find answers, ask questions, and share your expertise

Passing list of directories to ListHdfs Processor

avatar
New Contributor

We have around 6k tables in HDFS. All these tables are stored as sub-dirctories under a parent directory. As of now we're passing the root directory and recursive subdirectories as true in LISTHdfs processor. And later using routeonattribute processor to filter out the relevant tables using the path name. But the problem here is LISTHDFS is taking larger time just to list out as it needs to go through 6K directories. Is there is anyway possible to pass only required directories in listhdfs processor or any workaround available for this?

5 REPLIES 5

avatar
Master Collaborator

From provided details, It's not clear how the subdirectory structure looks immediately after root directory.

The approach could be using multiple ListHDFS configured with an immediate subdirectory of root with recursive subdirectories as true.

The flow will look like a Multiple of ListHDFS (Primary )--->All connected to one FetchHDFS (All Nodes ) ith LB at connection.

 

If you found this response assisted with your issue, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Chandan 

 

 

 

avatar
New Contributor

The folder structure is like /A/B, /A/C, /A/D, /A/D,..... Like this A is the root directory and there are 6000 subdirectories. We only need to read from 200 specific sub directories. Creating 200 ListHdfs Processors seems to be cluttered. Is there any workaround for this?

avatar
Master Collaborator

Thank you for the updated information.

So if you are not listing from each and every sub dir under root dir then you can use a regex pattern that just satisfies directory of your interest, Property Direceoty supports expression language.

 

Thank you 

avatar
New Contributor

That should help. Can you give us a example to apply the same.

avatar
Community Manager

@FROZEN2, if the reply has resolved your issue, can you please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future? 



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: