Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Why are FSDataOutputStream's hsync and getPos functions out of sync?

I have an application that uses FSDataOutputStream to write data to HDFS.


In order to write that data I use FSDataOutputStream's hflush function. In order to obtain the number of bytes that have been written I use FSDataOutputStream's getPos function.


For some reason after hflush has been called, getPos returns the wrong file size most of the time (sometimes it is correct).


My understanding is that when I call hflush, and after that when I call getPos, the file size in HDFS has to be equal (in bytes) to what getPos returns, but getPos always returns something greater! As though half of the file is still stuck in some buffer and hasn't reached a physical disk...


I read about the hsync function of FSDataOutputStream. I started using hsync instead of hflush, because it guarantees that the data will not be buffered and will be written to disk.

But the problem still persists, it is very rare now, but I still have the same issue. 10% of the time, when I call hsync, and then getPos, the file size in HDFS is less than what getPos returns.


Why is this happening and how can I synchronize getPos with hsync?


You have to call hsync with the SyncFlag.UPDATE_LENGTH argument

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.