We have an application continuously writing data with CSV format in a directory of HDFS. In our scenario, the data keep coming but not large in a batch. So the application keeps the files open and continuous writing them. After each batch of writing, it does a hard flush to make the data visible in the files and also increase the size of the files. As a result, there won't be too many small files and with the Impala "Refresh" command, the latest data can be seen immediately with CDH 5.16.1.
However, after the cluster is upgraded to CDH 6.3.1, the Impala "Refresh" command doesn't work. When a new file was created and written some data, the new data could be seen after refreshing, But afterwards, for the new coming data, even though I could directly see them through HDFS command and the size of the file was increased, I couldn't see them through the Impala SQL "Select". Only if the file was closed(the application was terminated), I could see the latest data through the Impala SQL "Refresh" and "Select".
The table is an external table partitioned by a timestamp column on monthly basis.
As per the document of the "Refresh" command, it should work for deleting, adding, or modifying files. Is it a bug?
By the way, I can see that the "invalidate metadata" SQL works. But it introduces 3-4 seconds of extra time for the next "SELECT" SQL. Indeed, the "SELECT" SQL usually takes a few seconds, so 3-4 seconds of additional time degraded the performance.
I also tried "Alter table xxx recover partitions", "alter table ... drop partition... / alter table ... add partition...", but with no luck.
Except the "invalidate metadata" SQL, is there a good way to work around this problem?
Created 03-31-2020 09:18 AM
You pointed a right direction. I added some codes to update the modification time of the files in HDFS, and the "Refresh" SQL worked now.
Created 03-27-2020 02:11 PM
Does the file modification timestamp change until you close the file? I am curious to know if this approach worked in any older version so that its easier to find what change in the code.
Created 03-30-2020 09:35 AM
No, the modification timestamp is not changed. But it worked with CDH5.16. After upgrading to CDH6.3, it didn't work again.
By the way, our cluster is Kerberos enabled. It was upgraded from CDH 5.16.2 to CDH 6.3.2.
Created 03-31-2020 09:18 AM
You pointed a right direction. I added some codes to update the modification time of the files in HDFS, and the "Refresh" SQL worked now.