I was trying a few things to understand hive. I could see that each insert statement to Hive table creates a new file in Hive backend directory, say in, /usr/hive/table1.
And each of the subsequent inserts to the table creates a file with the name 0000_0_copy1, 0000_0_copy2, etc.. the last number increments. So what is the maximum value of this number? Is this going to limit the number of inserts to Hive table?
Is it possible to merge all these files together into single file when required (after lot of files are created)?
There is no limit per se, however too many small files are terrible for hadoop. So you should avoid files that are smaller than a couple mb. For one the namenode doesn't like too many small files and secondly I have seen hive fail with too many partitions and files in a query because the tez graph with the splits gets just too big. ( ++ tens of thousands of files )
First way to avoid this is by grouping inserts together into bigger chunks. You can also use a CTAS table to recreate it once in a while ( which would rewrite all data into as many files as you have mappers/reducers in your CTAS query). However you would need to recreate the table, delete the old version and rename the new one.
For ORC files there is a statement ALTER TABLE ... CONCATENATE which merges small ORC files but I don't think it exists for delimited files for example.
And finally you could use any kind of pig/mapreduce job to read and write the files in hdfs however you again would need to recreate, delete originals, copy new ones in, so its as intrusive as the CTAS approach.
But the best thing to do is to avoid the small files from the beginning.
Looking at the code, copyId is limited to 6 digits. So you could have a max of 999999 per bucket per partition. However, this is a theoretical maximum and you should not try to do so many inserts. I have seen some cases where people call insert on every row which is an anti-pattern.