Created 01-12-2017 04:06 PM
Hi,
I need to know if the current file in HDFS that Storm writes to is recognizable as 'in flight' file? For instance Flume marks the in flight files like <filename>.tmp (or something like that). How does Storm do this?
Maybe somebody knows just like that, I hope so I don't have to build a test setup myself now.
Edit: final goal is to have a batch oriented process take on only completed/closed files.
Created 01-25-2017 12:02 PM
With the help of the remarks by @Aaron Dossett I found a solution to this.
Knowing that Storm does not mark the hdfs file currently being written to, and the .addRotationAction not robust enough in extreme cases I turned to a low level solution.
HDFS can report the files on a path that are open for write:
hdfs fsck <storm_hdfs_state_output_path> -files -openforwrite
or alternatively you can just list only NON open files on a path:
hdfs fsck <storm_hdfs_state_output_path> -files
The output is quite verbose but you can use sed or awk to get closed/completed files from there.
(Java HDFS api has similar hooks, this is just for CLI level solution)
Created 01-25-2017 12:02 PM
With the help of the remarks by @Aaron Dossett I found a solution to this.
Knowing that Storm does not mark the hdfs file currently being written to, and the .addRotationAction not robust enough in extreme cases I turned to a low level solution.
HDFS can report the files on a path that are open for write:
hdfs fsck <storm_hdfs_state_output_path> -files -openforwrite
or alternatively you can just list only NON open files on a path:
hdfs fsck <storm_hdfs_state_output_path> -files
The output is quite verbose but you can use sed or awk to get closed/completed files from there.
(Java HDFS api has similar hooks, this is just for CLI level solution)