We often drop and recreate Hive tables with new/recalculated data under same table names. However, after longer use I've found that fetching the data from these tables became broken: i.e., count(*) from that table estimates 2k rows, while export using beeline to csv returns >100k rows, most of which are NULL's at the end of the file. Usually, the problem does not repeat itself when trying to recreate the same tables under different names, but I don't think it's a good solution to create hundreds of tables ala table1, ..., table506 to be able to work with the file..
So my question: is there a fix for this? How can we be sure that dropping the old tables also drops the data?
When a table is dropped, the data in case of managed tables gets cleaned up. However for an external table only the metadata of the table is cleared and the data still persists as it is in its place. Could you please confirm if you are using managed or external table?
As far as I know, the table is not external, it is created by joining multiple other tables. What worries me is that dropping the table takes like 0.5 sec, so it is not obvious that the data is deleted so quickly (haven't checked to make sure yet, to be honest). Further, there were cases when repeating the same select * .. > output.csv (from beeline) returned different number rows! Those extra rows were made of NULL's, that's why I'm guessing that it doesn't manage leftover files well.
Lastly, some of the tables, used in constructing the join, are made in pySpark with the following manner:
sqlContext.sql('DROP TABLE IF EXISTS database.example_table_fromPy')
sqlContext.sql("CREATE TABLE database.example_table_fromPy STORED AS ORC as select * from pivoted_temp")
I'm wondering, whether the file structure imposed by python (or something like that) may be incompatible with hive tables?