Support Questions

Find answers, ask questions, and share your expertise
Announcements
We’ve updated our product names and community labels - click here for full details

FetchSMB not fetching all files

avatar
Contributor

Hi

Iam using ListSMB which lists all the files correctly on the source system.

However, not all files are fetched using FetchSMB, some of the files are skipped with below error.

  1. could this be due to // in the file path ?
  2. is it due to some issue on the SMB server settings ?

FetchSmb[id=03ab6b24-019d-1000-0000-000032e5cf1d] Could not fetch file SDM-Prod-Reports/EOS_TFC_Metrics//EOS_TFC_Metrics.csv.: java.io.IOException: Could not create session for share smb://ifiler-smb03:445/SFTP
- Caused by: com.hierynomus.smbj.common.SMBRuntimeException: com.hierynomus.protocol.transport.TransportException: java.util.concurrent.ExecutionException: com.hierynomus.smbj.common.SMBRuntimeException: java.util.concurrent.TimeoutException: Timeout expired
- Caused by: com.hierynomus.protocol.transport.TransportException: java.util.concurrent.ExecutionException: com.hierynomus.smbj.common.SMBRuntimeException: java.util.concurrent.TimeoutException: Timeout expired
- Caused by: java.util.concurrent.ExecutionException: com.hierynomus.smbj.common.SMBRuntimeException: java.util.concurrent.TimeoutException: Timeout expired
- Caused by: com.hierynomus.smbj.common.SMBRuntimeException: java.util.concurrent.TimeoutException: Timeout expired
- Caused by: java.util.concurrent.TimeoutException: Timeout expired

 

Current version: 

2.4.0
Tagged rel/nifi-2.4.0

Thanks

 

 

 

1 ACCEPTED SOLUTION

avatar
Master Mentor

@nisaar 

The exception indicates the an initial connection issue resulting in a failing to complete the connection.  This would be network or server side issue and not a client (ListSMB/FetchSMB) issue.  

Usually the files listed and fetched are done by Primary node itself

This statement is not clear.  What does "Usually" mean.  The ListSMB processor should be configured to only execute on the "Primary node" only to prevent multiple nodes in your NiFi cluster from listing the same files multiple times.    If the ListSMB processor is configured for "primary node" execution and you are seeing FlowFile specific to this flow being listed on different nodes then the node that was elected as primary node is changing.  I'd suggest taking a closer look at the logs or node events via the NiFi UI to see why the cluster coordinator role is changing nodes.  Maybe you are experiencing some long stop the world Garbage Collection pauses (could lead to timed out connections).  Maybe you Primary nodes Core load average is exceptionally high as well since you are not distributing the workload across all your nodes or you have concurrent tasks set to high.

  1. How many concurrent tasks do you have configured on the FetchSMB processor?
  2. Have you inspected the SMB server logs at the times of these failed connections for any errors or events during these connection attempts?
  3. How many nodes in your NiFi cluster?  Is their a reason that you are not using load balancing on the connection between ListSMB and FetchSMB so that all your NiFi cluster nodes share the workload on fetch the content and processing it?
  4. Since it is intermittent failure, have you built retry into your design?  You can set "retry" on the failure relationship that will trigger NiFi to re-queue the failed FlowFile so it is retried a configurable number of times before finally being routed to the connection containing the "failure" relationship.

 

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

View solution in original post

4 REPLIES 4

avatar
Master Mentor

@nisaar 

The ListSMB processor only fetches metadata about the files in the target SMB location.  For each file found it creates a 0 byte NiFi FlowFile that includes a bunch of metadata that can be used to fetch the content later by the FetchSMB processor.   The List<type> and Fetch<type> processors are used to make sure one node in a multi-node NiFi cluster si not doing all the heavy work.  The List<type> processor would be configured to run on "Primary Node" only.  The success relationship would be connected to the FetchSMB via a connection.  That connection would the need to be configured to load balance the 0 Byte FlowFiles across all your NiFi nodes so that each could Fetch a fair share of the content and process a fair share of the workload of this dataflow. 

What are the difference between the files that fail on content fetch versus those that are successful? 

  • Are these files larger resulting in a timeout exception?
  • Are those that are timing out always being fetched by one specific node in your NiFi cluster?   Have you verified the all nodes can successfully connect to the SMB server?

Have you tried increasing the timeout set in the SmbjClientProviderService used by the SMB processors?  Try setting it to 60 seconds or higher to see if the failed files can successfully fetch the content from SMB.

 

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Contributor

Thanks for your reply.

Files are around 100 MB to 200 MB.

Usually the files listed and fetched are done by Primary node itself.

One more observation is it's not always the same file that fails, the file that got failed fetching works fine sometimes.

So, my understanding is its nothing to do with file. session is timing out for some reason.

After adding timeout of 60 sec in SmbjClientProviderService got below error
FetchSmb[id=03ab6b24-019d-1000-0000-000032e5cf1d] Could not fetch file SDM-Prod-Reports/EOS_TFC_Metrics//EOS_TFC_Metrics.csv.: java.io.IOException: Could not create session for share smb://ifiler-smb03:445/SFTP
- Caused by: com.hierynomus.smbj.common.SMBRuntimeException: com.hierynomus.protocol.transport.TransportException: Cannot write SMB2_SESSION_SETUP with message id << 11 >> as transport is disconnected
- Caused by: com.hierynomus.protocol.transport.TransportException: Cannot write SMB2_SESSION_SETUP with message id << 11 >> as transport is disconnected

Thanks

 

avatar
Master Mentor

@nisaar 

The exception indicates the an initial connection issue resulting in a failing to complete the connection.  This would be network or server side issue and not a client (ListSMB/FetchSMB) issue.  

Usually the files listed and fetched are done by Primary node itself

This statement is not clear.  What does "Usually" mean.  The ListSMB processor should be configured to only execute on the "Primary node" only to prevent multiple nodes in your NiFi cluster from listing the same files multiple times.    If the ListSMB processor is configured for "primary node" execution and you are seeing FlowFile specific to this flow being listed on different nodes then the node that was elected as primary node is changing.  I'd suggest taking a closer look at the logs or node events via the NiFi UI to see why the cluster coordinator role is changing nodes.  Maybe you are experiencing some long stop the world Garbage Collection pauses (could lead to timed out connections).  Maybe you Primary nodes Core load average is exceptionally high as well since you are not distributing the workload across all your nodes or you have concurrent tasks set to high.

  1. How many concurrent tasks do you have configured on the FetchSMB processor?
  2. Have you inspected the SMB server logs at the times of these failed connections for any errors or events during these connection attempts?
  3. How many nodes in your NiFi cluster?  Is their a reason that you are not using load balancing on the connection between ListSMB and FetchSMB so that all your NiFi cluster nodes share the workload on fetch the content and processing it?
  4. Since it is intermittent failure, have you built retry into your design?  You can set "retry" on the failure relationship that will trigger NiFi to re-queue the failed FlowFile so it is retried a configurable number of times before finally being routed to the connection containing the "failure" relationship.

 

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Contributor

Sorry for the delayed response.

We were able to kind of resolve the issue by adding a retry on ListSMB and FetchSMB processors. 

 

Number of Attempts : 2

Retry Back Off Policy: Penalize

Retry maximum backoff period : 1 minute 

 

To test the working we have scheduled it to run every 30 min.

However, we are observing that whenever a retry happens the scheduler won't run  on scheduled time. Not sure how retry is affecting scheduler.

 

Thanks!