We have an external Hive table that is used for processing raw log file data. The files are hourly, but are partitioned by date and source host name.
At the moment we are importing files using simple python scripts that is triggered a few times per hour. The script creates sub folders on hdfs as needed, copies new files from the temporary local storage and adds any new partitions to Hive.
However, whenever a Hive query is running on the table it will be locked, and so adding new partitions will fail (if the query runs for long enough). This is causing us some headaches as we have reports set up to be generated from the log files daily, ant these end up really empty when there are no partitions to read from.
Is there a way to add partitions to an external table without running into locking issues? I would prefer not having to resort to "msck repair table ..."
Or are we doing the whole importing thing the wrong way completely? If so, what are the best practies we should be following?
P.S We are using CDH 5.3.1 nd not 4.0.x as the label indicates. Not sure why you are forced to choose a label when the correct ones are not available.