Support Questions

Find answers, ask questions, and share your expertise

Storm HDFS Bolt question (Trident api)

avatar
Super Collaborator

Hi,

I need to know if the current file in HDFS that Storm writes to is recognizable as 'in flight' file? For instance Flume marks the in flight files like <filename>.tmp (or something like that). How does Storm do this?

Maybe somebody knows just like that, I hope so I don't have to build a test setup myself now.

Edit: final goal is to have a batch oriented process take on only completed/closed files.

1 ACCEPTED SOLUTION

avatar
Super Collaborator

With the help of the remarks by @Aaron Dossett I found a solution to this.

Knowing that Storm does not mark the hdfs file currently being written to, and the .addRotationAction not robust enough in extreme cases I turned to a low level solution.

HDFS can report the files on a path that are open for write:

hdfs fsck <storm_hdfs_state_output_path> -files -openforwrite

or alternatively you can just list only NON open files on a path:

hdfs fsck <storm_hdfs_state_output_path> -files

The output is quite verbose but you can use sed or awk to get closed/completed files from there.

(Java HDFS api has similar hooks, this is just for CLI level solution)

View solution in original post

10 REPLIES 10

avatar
Super Collaborator

With the help of the remarks by @Aaron Dossett I found a solution to this.

Knowing that Storm does not mark the hdfs file currently being written to, and the .addRotationAction not robust enough in extreme cases I turned to a low level solution.

HDFS can report the files on a path that are open for write:

hdfs fsck <storm_hdfs_state_output_path> -files -openforwrite

or alternatively you can just list only NON open files on a path:

hdfs fsck <storm_hdfs_state_output_path> -files

The output is quite verbose but you can use sed or awk to get closed/completed files from there.

(Java HDFS api has similar hooks, this is just for CLI level solution)