Support Questions

Find answers, ask questions, and share your expertise

Can HdfsBolt append data to existing file?

avatar
Explorer

I want to write data to new file if file not exist and append data to existing file using storm hdfs connector HdfsBolt. May I know how to do this ? Appreciate for any suggestions.

1 ACCEPTED SOLUTION

avatar
Master Guru

File on HDFS are immutable. Hdfs bolt allows for example "After every 1,000 tuples it will sync filesystem, making that data visible to other HDFS clients. It will rotate files when they reach 5 megabytes in size."

So you can buffer up events until specified interval. Take a look at my github storm code. You will see how that is performed

https://github.com/sunileman/storm-twitter-sentiment

View solution in original post

6 REPLIES 6

avatar
Master Guru

File on HDFS are immutable. Hdfs bolt allows for example "After every 1,000 tuples it will sync filesystem, making that data visible to other HDFS clients. It will rotate files when they reach 5 megabytes in size."

So you can buffer up events until specified interval. Take a look at my github storm code. You will see how that is performed

https://github.com/sunileman/storm-twitter-sentiment

avatar
Explorer

Appreciate for the advice between the file name will be named something like ddmmyyyy-hh. I want to group the log by hourly and the event per second can be changed, so the number of tuples and file side cannot be determined. In this case how to do it?

avatar
Master Guru

So your looking for windowing on storm.ie do somethikg based on a specificed time period. Until recently you had to build your own windowing logic in storm by keep track of time and do some disk cache to hold events until window tome has completed. Now the functionality comes out of the box. Take a look at an excellent article written on how the new functionality works in storm here. https://community.hortonworks.com/articles/14171/windowing-and-state-checkpointing-in-apache-storm.h...

avatar
Explorer

Thank you for the advice,the new feature of windowing seem able to solve my problem but only concern is the in memory capability to hold 1 hour data , may i know any example for how to configure /do the disk cache ?

Between i found out some example of doing append in hdfs

http://stackoverflow.com/questions/32339602/append-to-file-in-hdfs-cdh-5-4-5

the CDH platform can do the append ?

avatar
Master Guru

If you are concerned about memory you can persist the data to hdfs and once the window period is over recombine all persisted data and push to your hour location.

avatar
Explorer

Thank you for the answer