When we insert data from staging table into a production table using dynamic partition inserts, the files created at the partition directory are like: 0000_0.
However, say, for a process where data is loaded on a daily basis, after the first data insertion in a partition, the file names are like 0000_0_copy_1 for the second day, 0000_0_copy_2 for the third day and so on...
I want to create a filename like so: partitionName_datestamp [ex. IND_20173107] so that it helps to maintain a logical and relevant file structure for any manual intervention needs.
I am aware that we can achieve this by executing a shell script after Hive jobs. But, can we control this from within Hive?
If the files are loaded into the same partition on a daily basis then by default in Hive it will be created as 0000_0_copy_1 and it goes on further. Because everytime a mapreduce job is executed its output will be a file with default name a 0000_0. If that file name exists already then it will be created as 0000_0_copy_1. As you said we cannot handle this is Hive as its how hive behaves by default. By if you wanted to segregate it based on the date of insertion then you could create the table with partition based on the date of insertion. So by that way you will be able to identify the records which are inserted on a daily basis. But keep in mind that the column should be date data type. If its in timestamp then it will end up in creating multiple partitions/directories which makes the metadata of the table huge and it will result in performance issues.
partitioning is not the question here. its the filenames itself.
How do i fix the filename at data insertion time?