Support Questions

Find answers, ask questions, and share your expertise

Why does NiFi GetFile work with KeepSourceFile=true but not when KeepSourceFile=false

avatar
Guru

I have a simple flow of GetFile -> PutHDFS. The flow works when KeepSourceFile=true. I then turn the processors off, empty the target directory in hdfs, reconfigure GetFile identically except KeepSourceFile=false and turn them back on. The files are in their local source directory with full 777 privs but never get read by GetFile. Scheduler for each processor is Timer at 10 s. This is running nifi installed on the sandbox. Any ideas on why it is not working?

8 REPLIES 8

avatar

Hello

Please take a look in the logs/nifi-app.log. There should be errors. Sounds like it might not be able to delete the files (perms on the directory itself perhaps). If nothing interesting in the logs try updating your conf/logback.xml by adding this line in with other similar looking lines

<logger name="org.apache.nifi.processors.standard.GetFile" level="DEBUG"/>

Thanks Joe

avatar

@gkeys it may be helpful to enable DEBUG-level logging, configured in $NIFI_HOME/conf/logback.xml.

In that file, the fully-qualified class name for the processor for which logging should be enabled can be specified. For example,

<logger name="org.apache.nifi.processors.standard.GetFile" level="DEBUG"/>

would enable DEBUG level logging for every GetFile in your flow.

avatar
Guru

@jwitt @slachterman

when config is KeepSourceFile=false I get

2016-07-15 19:49:47,338 INFO [StandardProcessScheduler Thread-6] o.a.n.c.s.TimerDrivenSchedulingAgent Scheduled GetFile[id=fdb1f403-f4df-446d-bf81-6732f02fc909] to run with 1 threads
2016-07-15 19:49:47,339 DEBUG [Timer-Driven Process Thread-9] o.a.nifi.processors.standard.GetFile GetFile[id=fdb1f403-f4df-446d-bf81-6732f02fc909] has chosen to yield its resources; will not be scheduled to run again for 10 seconds
2016-07-15 19:50:07,341 DEBUG [Timer-Driven Process Thread-6] o.a.nifi.processors.standard.GetFile GetFile[id=fdb1f403-f4df-446d-bf81-6732f02fc909] has chosen to yield its resources; will not be scheduled to run again for 10 seconds
2016-07-15 19:50:17,342 DEBUG [Timer-Driven Process Thread-9] o.a.nifi.processors.standard.GetFile GetFile[id=fdb1f403-f4df-446d-bf81-6732f02fc909] has chosen to yield its resources; will not be scheduled to run again for 10 seconds

when it is true I get

2016-07-15 19:52:44,497 INFO [StandardProcessScheduler Thread-4] o.a.n.c.s.TimerDrivenSchedulingAgent Scheduled GetFile[id=fdb1f403-f4df-446d-bf81-6732f02fc909] to run with 1 threads
2016-07-15 19:52:44,504 INFO [Timer-Driven Process Thread-6] o.a.nifi.processors.standard.GetFile GetFile[id=fdb1f403-f4df-446d-bf81-6732f02fc909] added StandardFlowFileRecord[uuid=7da0d589-4a97-4d93-9ecb-a5d22c7d520c,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1468610929062-156, container=default, section=156], offset=302824, length=152756],offset=0,name=20160708-233120.tsv,size=152756] to flow
2016-07-15 19:52:44,506 INFO [Timer-Driven Process Thread-6] o.a.nifi.processors.standard.GetFile GetFile[id=fdb1f403-f4df-446d-bf81-6732f02fc909] added StandardFlowFileRecord[uuid=b59f416c-7cc8-4529-adb7-92612f6e5f6e,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1468610929062-156, container=default, section=156], offset=455580, length=150068],offset=0,name=20160709-092708.tsv,size=150068] to flow
2016-07-15 19:52:54,511 INFO [Timer-Driven Process Thread-1] o.a.nifi.processors.standard.GetFile GetFile[id=fdb1f403-f4df-446d-bf81-6732f02fc909] added StandardFlowFileRecord[uuid=186c9e29-b80a-48d2-8747-94a74cfd2aee,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1468610929062-156, container=default, section=156], offset=605648, length=152756],offset=0,name=20160708-233120.tsv,size=152756] to flow
2016-07-15 19:52:54,512 INFO [Timer-Driven Process Thread-1] o.a.nifi.processors.standard.GetFile GetFile[id=fdb1f403-f4df-446d-bf81-6732f02fc909] added StandardFlowFileRecord[uuid=5b6683ec-f7eb-4724-a632-6abf09ae57a9,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1468610929062-156, container=default, section=156], offset=758404, length=150068],offset=0,name=20160709-092708.tsv,size=150068] to flow
2016-07-15 19:53:04,514 INFO [Timer-Driven Process Thread-4] o.a.nifi.processors.standard.GetFile GetFile[id=fdb1f403-f4df-446d-bf81-6732f02fc909] added StandardFlowFileRecord[uuid=8332e96a-c66a-42eb-a2f1-f6edaf88362c,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1468610929062-156, container=default, section=156], offset=908472, length=152756],offset=0,name=20160708-233120.tsv,size=152756] to flow
2016-07-15 19:53:04,515 INFO [Timer-Driven Process Thread-4] o.a.nifi.processors.standard.GetFile GetFile[id=fdb1f403-f4df-446d-bf81-6732f02fc909] added StandardFlowFileRecord[uuid=47a9cdbf-3820-40af-93d0-f12108b37010,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1468612384514-157, container=default, section=157], offset=0, length=150068],offset=0,name=20160709-092708.tsv,size=150068] to flow

The what makes sense. Thoughts on the why?

avatar

Not sure just yet. Will take a look. The only time GetFile would yield, as is the case in the log output you show for keepFile=true, is when it finds nothing in the listing.

avatar
Guru

Just to be clear -- the only change is the keepFile flag. Files are there and GetFile points to them identically in both cases.

avatar
Guru

Also, this is on sandbox

avatar

GetFile when told to keep source files where it finds them will capture them even if it doesn't have write permissions to the directory they are contained in. However, when told to remove source files once pulled it requires write permissions to the directory it is pulling from and when listing it will skip those which it doesn't have permissions for. Given that we know there are files there and it isn't pulling them in this case and specifically yielding, which only happens when the listing attempt provides no valid results, then I strongly believe the parent directory permissions are not sufficient. Please verify.

avatar
Master Mentor

@gkeys

What are the permissions on both the file(s) you are trying to pickup with the GetFile processor and the permissions on the directory the file(s) live in?

-rwxrwxrwx 1 nifi dataflow 24B Jul 18 18:20 testfile

and

drwxr-xr-- 3 root dataflow 102B Jul 18 18:20 testdata

With the above example permission, I reproduce exactly what you are seeing. If "Keep Source File" is set to true, NiFi creates a new flowfile with the content of the file. If "Keep Source File" is set to false, NiFi GetFile yields because it does not have the necessary permissions to delete the file from the directory. This is because the write bit is required on the source directory for the user who is trying to delete the file(s). In my example nifi is running as user nifi, so he can read the files in the root owned testdata directory because the directory group ownership is dataflow just like my nifi user and the dir has r-x permissions. fi i change that dir permissions to rwx then my nifi user will also be able to delete the testfile.

Thanks, Matt