- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Storm HDFS Bolt question (Trident api)
- Labels:
-
Apache Hadoop
-
Apache Storm
Created ‎01-12-2017 04:06 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I need to know if the current file in HDFS that Storm writes to is recognizable as 'in flight' file? For instance Flume marks the in flight files like <filename>.tmp (or something like that). How does Storm do this?
Maybe somebody knows just like that, I hope so I don't have to build a test setup myself now.
Edit: final goal is to have a batch oriented process take on only completed/closed files.
Created ‎01-25-2017 12:02 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With the help of the remarks by @Aaron Dossett I found a solution to this.
Knowing that Storm does not mark the hdfs file currently being written to, and the .addRotationAction not robust enough in extreme cases I turned to a low level solution.
HDFS can report the files on a path that are open for write:
hdfs fsck <storm_hdfs_state_output_path> -files -openforwrite
or alternatively you can just list only NON open files on a path:
hdfs fsck <storm_hdfs_state_output_path> -files
The output is quite verbose but you can use sed or awk to get closed/completed files from there.
(Java HDFS api has similar hooks, this is just for CLI level solution)
Created ‎01-25-2017 12:02 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With the help of the remarks by @Aaron Dossett I found a solution to this.
Knowing that Storm does not mark the hdfs file currently being written to, and the .addRotationAction not robust enough in extreme cases I turned to a low level solution.
HDFS can report the files on a path that are open for write:
hdfs fsck <storm_hdfs_state_output_path> -files -openforwrite
or alternatively you can just list only NON open files on a path:
hdfs fsck <storm_hdfs_state_output_path> -files
The output is quite verbose but you can use sed or awk to get closed/completed files from there.
(Java HDFS api has similar hooks, this is just for CLI level solution)

- « Previous
-
- 1
- 2
- Next »