I have an application that uses FSDataOutputStream to write data to HDFS.
In order to write that data I use FSDataOutputStream's hflush function. In order to obtain the number of bytes that have been written I use FSDataOutputStream's getPos function.
For some reason after hflush has been called, getPos returns the wrong file size most of the time (sometimes it is correct).
My understanding is that when I call hflush, and after that when I call getPos, the file size in HDFS has to be equal (in bytes) to what getPos returns, but getPos always returns something greater! As though half of the file is still stuck in some buffer and hasn't reached a physical disk...
I read about the hsync function of FSDataOutputStream. I started using hsync instead of hflush, because it guarantees that the data will not be buffered and will be written to disk.
But the problem still persists, it is very rare now, but I still have the same issue. 10% of the time, when I call hsync, and then getPos, the file size in HDFS is less than what getPos returns.
Why is this happening and how can I synchronize getPos with hsync?