I have some staging tables in hdfs, to improve performance, i generate new tables with new partitions (year/month) to reduce the number of files and increase the file size, also there are couple tables are change logs, so i also generate new tables to store the latest from them.
What are some the options to automate this approach, we need maybe nightly to re-create/update the current month's partition, and re-generate/update the tables with the latest data.
From 0.14 only supports update but with one condition it only supports ORC format .
Many migrate to KUDU for CRUD operations ? is that something you would consider ?
CREATE TABLE sample_partitioning_table (name string,age int , PRIMARY KEY(year,month,day)) RANGE(year, month) ( PARTITION VALUE = (2017, 01), PARTITION VALUE = (2017, 02), PARTITION VALUE = (2017, 03), ) STORED AS KUDU;
Basically perform a ALTER TABLE add a new partition set the new location , but the old data needs to be manually dropped . its just the pain in the back .
Thanks, sorry if i was not clear.
My staging is partitioned by year/month/day, there are many small sized files, so i create new tables partitioned by year/month that are built from the staging tables. So say for the current month, i need to update the new tables to add the new data. My understand is that it is not just appending, in new tables i have 1-3 files in each partition, it is not just keep appending to a file? at some point it might need a new file, is it better to drop the current month and regenerate from staing table.