Support Questions

JohnSilver · ‎08-08-2023

Quoting directly from the official docu: FetchSMB: "Fetches files from a SMB Share. Designed to be used in tandem with ListSmb."

My workflow is pretty simple, I have ListSMB which reads from a shared network directory, connected to FetchSMB which should get 1 file (the processor forces you to put the path of a specific file or it won't work) and finally PutHDFS to write all the files in a distributed file system (hadoop).

I don't / can't understand why whenever I use List and Fetch in tandem the pipeline will just take all the files it can find on the directory specified in the List processor but won't write to the HDFS the single file I specified on the Fetch, instead it will write on the HDFS all the files from the List. The fetch, directly connected to the HDFS, should pass only a single file. What am I missing here ? What is the purpose of the fetch if it is "by-passed" by the list processor ?
On top of that, all the file written to the HDFS will pick the same weight (KB or MB) of the single file I specified on the Fetch, resulting in many files being broken / corrupted.

As you can see from the pic above. They SHOULD NOT have the same weight.

I thought that list + fetch were used primarily to move big data, but then I ask myself why the Fetch processor wants me to indicate the path of a specific file and then ignore it ?

I didn't find the documentation helpful at all in this regard.

Thank you and have a nice day.

EDIT:

Added some screenshots for clarity.

Dataflow (pretty simple):

ListSMB:

(I am using "no tracking" just to make testing easier and faster. In prod we will have to set up a tracking strategy.)

FetchSMB:

PutHDFS:

This is a sample screenshot of the queue list (between the list and the fetch processors):

This is another screenshot of the queue (between fetch and putHDFS processors)

As you can see the "file size" is over 60mb, but it's just a .png which should be no more than 300kb. 60mb is the file size of the file specified in the properties of the Fetch processor (20211021_PL_GIO.zip)

I hope that it's better now! If you need other info or screenshots just let me know.

Have a nice day.

MattWho · ‎08-08-2023

@JohnSilver

Your are not using the processor correctly resulting in your issue.
The "list<type>" processors are designed to optimize distributed processing of files from sources that may not be NiFi cluster friendly. They are designed to simply list the contents of the target input directory and produce a single 0 byte FlowFile with metadata about the content.
Example of what FlowFile Attributes are created by the ListSMB processor:

The list processors are intended to be configured to execute on "primary node" only in their configuration.
This prevents all nodes in a NiFi cluster from listing the same files.

These 0 Byte FlowFiles can the be distributed/load-balanced across all the other nodes in the cluster using the load balanced configuration capability on a NiFi connection.

Finally these now distributed 0 byte FlowFiles are sent to the Fetch<type> processor that should be configured to use the metadata/attributes on the FlowFile to retrieve the content for each listed FlowFile and insert in to the FlowFile content (after Fetch<type> processor, FlowFile size will no longer be 0 bytes).

Where you have misconfigurations:
1. You configured listSMB with "no tracking" which means it will retain no state on what was previously listed. Without state, every execution of the listSMB will result in listing the same source files over and over again.
2. Each listed file becomes its own 0 byte FlowFile with a bunch of added attributes. You are then passing that to your FetchSMB processor. The default configuration of the "Remote File" processor property is "${path}/${filename}" which would take the value from the FlowFile attributes "path" and "filename" to fetch the content for the FlowFile. You have instead misconfigured this property to always fetch the same content no matter which FlowFile is being processed. So you are inserting the same content in to every one of your unique FlowFiles (each listed FlowFile has a filename that is persisted). That is why you see same weight/size for all your fetched FlowFiles.

You might try configuring your listSMB processor Input Directory to point at the full path to the single file you want to list. I have not tried this as the intended usage is listing everything from within a target directory. If that does not work, you could use a routeOnAttribute processor to route only the FlowFile with the specific filename you are looking for to FetchSMB and terminate unmatched.

Also keep in mind the your ListSMB is by default going to re-list the same File(s) over and over because you have it configured with "no tracking" and default run schedule is "0 sec" (which means schedule to execute as fast/often as possible)

Your PutHDFS appears to be working as expected, your issues seems purely in your upstream configurations.

If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.

Thank you,

Matt

View solution in original post

steven-matison · ‎08-08-2023

@JohnSilver A few advices: In the future post screen shot of flow, and important processor configuration screen shots. You can also share a flow definition file on github, and sample data so we can help see what you see. Without those some advice is below.

It sounds like ListSMB is returning a list of all results( files and or paths). Can you share that output? (Run ListSMB once, inspect flowfile, share here in a code block). Without seeing that I think another processor should iterate through that list flowfile contents, and fetch each file/path independently.

JohnSilver · ‎08-08-2023

Hey Steven! Thanks for your reply.

Dataflow (pretty simple):

ListSMB:

(I am using "no tracking" just to make testing easier and faster. In prod we will have to set up a tracking strategy.)

FetchSMB:

PutHDFS:

This is a sample screenshot of the queue list (between the list and the fetch processors):

This is another screenshot of the queue (between fetch and putHDFS processors)

As you can see the "file size" is over 60mb, but it's just a .png which should be no more than 300kb. 60mb is the file size of the file specified in the properties of the Fetch processor (20211021_PL_GIO.zip)

I hope that it's better now! If you need other info or screenshots just let me know.

Have a nice day.

steven-matison · ‎08-08-2023

@JohnSilver Open the queue list between list/fetch and look at the content of flowfile (0.00 size). Take the (i) or the (👁) in the list to go deeper. Also inspect the attributes tab attached to the flowfile. Looking for file names, path, etc... those bits need to be parsed to fetchSMB as attributes, and then the right file will be fetched for putHDFS.

JohnSilver · ‎08-09-2023

Hey Steven! Thank you again for your reply.

As @MattWho said in his reply, it looks like I was misusing the List-Fetch tandem processors. If the source directory has more than one file to be listed, linking the processor directly to the fetch will instruct the fetch processor to produce N files (with N being the number of files in the source directory) having the same content (the content of the file specified in the properties of the fetch processors) but the attributes of the original files. I thought that the fetch processor would do a sort of "filtering" by means of the "path" and "filename" variables in its properties, but in reality it does not filter anything.

The documentation does not cover this potential issue. On the contrary, to be honest, the documentation seems to suggest to use them linked together.

Anyway, thanks for your advice, I will follow the "work flow" you suggested whenever I am going to face other issues or bugs.

Have a nice day!

MattWho · ‎08-08-2023

@JohnSilver

Your are not using the processor correctly resulting in your issue.
The "list<type>" processors are designed to optimize distributed processing of files from sources that may not be NiFi cluster friendly. They are designed to simply list the contents of the target input directory and produce a single 0 byte FlowFile with metadata about the content.
Example of what FlowFile Attributes are created by the ListSMB processor:

The list processors are intended to be configured to execute on "primary node" only in their configuration.
This prevents all nodes in a NiFi cluster from listing the same files.

These 0 Byte FlowFiles can the be distributed/load-balanced across all the other nodes in the cluster using the load balanced configuration capability on a NiFi connection.

Finally these now distributed 0 byte FlowFiles are sent to the Fetch<type> processor that should be configured to use the metadata/attributes on the FlowFile to retrieve the content for each listed FlowFile and insert in to the FlowFile content (after Fetch<type> processor, FlowFile size will no longer be 0 bytes).

Where you have misconfigurations:
1. You configured listSMB with "no tracking" which means it will retain no state on what was previously listed. Without state, every execution of the listSMB will result in listing the same source files over and over again.
2. Each listed file becomes its own 0 byte FlowFile with a bunch of added attributes. You are then passing that to your FetchSMB processor. The default configuration of the "Remote File" processor property is "${path}/${filename}" which would take the value from the FlowFile attributes "path" and "filename" to fetch the content for the FlowFile. You have instead misconfigured this property to always fetch the same content no matter which FlowFile is being processed. So you are inserting the same content in to every one of your unique FlowFiles (each listed FlowFile has a filename that is persisted). That is why you see same weight/size for all your fetched FlowFiles.

You might try configuring your listSMB processor Input Directory to point at the full path to the single file you want to list. I have not tried this as the intended usage is listing everything from within a target directory. If that does not work, you could use a routeOnAttribute processor to route only the FlowFile with the specific filename you are looking for to FetchSMB and terminate unmatched.

Also keep in mind the your ListSMB is by default going to re-list the same File(s) over and over because you have it configured with "no tracking" and default run schedule is "0 sec" (which means schedule to execute as fast/often as possible)

Your PutHDFS appears to be working as expected, your issues seems purely in your upstream configurations.

If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.

Thank you,

Matt

JohnSilver · ‎08-09-2023

Hey Matt!

I am truly grateful for your answer.

You made me realize what I was doing wrong and I learned a lot while reading trough your explanation.

I am going to revise the documentation again, as I am realizing that I did not had some concepts crystal clear before.

Thanks again and I wish you a good day.

Cloudera Community

Support Questions

(Apache NiFi) - ListSMB + FetchSMB don't work as expected