I have an application that usesFSDataOutputStreamto write data to HDFS.
In order to write that data I useFSDataOutputStream'shflushfunction. In order to obtain the number of bytes that have been written I useFSDataOutputStream'sgetPosfunction.
For some reason afterhflushhas been called,getPosreturns the wrong file size most of the time (sometimes it is correct).
My understanding is that when I callhflush, and after that when I callgetPos, the file size in HDFS has to be equal (in bytes) to whatgetPosreturns, butgetPosalways returns something greater! As though half of the file is still stuck in some buffer and hasn't reached a physical disk...
I read about thehsyncfunction ofFSDataOutputStream. I started usinghsyncinstead ofhflush, because it guarantees that the data will not be buffered and will be written to disk.
But the problem still persists, it is very rare now, but I still have the same issue. 10% of the time, when I callhsync, and thengetPos, the file size in HDFS is less than whatgetPosreturns.
Why is this happening and how can I synchronizegetPoswithhsync?