Reply
Explorer
Posts: 99
Registered: ‎09-14-2016

Hive regenerate tables question

Hi,

 

I have some staging tables in hdfs, to improve performance, i generate new tables with new partitions (year/month) to reduce the number of files and increase the file size, also there are couple tables are change logs, so i also generate new tables to store the latest from them.

 

What are some the options to automate this approach, we need maybe nightly to re-create/update the current month's partition, and re-generate/update the tables with the latest data.

 

Thanks

Shannon

Champion
Posts: 459
Registered: ‎05-16-2016

Re: Hive regenerate tables question

From 0.14 only supports update but with one condition it only supports ORC format . 

Many migrate to KUDU for CRUD operations ? is that something you would consider ?

 

Kudu 

CREATE TABLE sample_partitioning_table (name string,age int , PRIMARY KEY(year,month,day))
RANGE(year, month) (
 PARTITION VALUE = (2017, 01),
 PARTITION VALUE = (2017, 02),
 PARTITION VALUE = (2017, 03),
 ) STORED AS KUDU;
 

Hive  - 

Basically perform a ALTER TABLE add a new partition set the new location , but the old data needs to be manually dropped . its just the pain in the back . 

 

Explorer
Posts: 99
Registered: ‎09-14-2016

Re: Hive regenerate tables question

Thanks, sorry if i was not clear.

 

My staging is partitioned by year/month/day, there are many small sized files, so i create new tables partitioned by year/month that are built from the staging tables. So say for the current month, i need to update the new tables to add the new data. My understand is that it is not just appending, in new tables i have 1-3 files in each partition, it is not just keep appending to a file? at some point it might need a new file, is it better to drop the current month and regenerate from staing table.

 

 

Thanks

Shannon

Announcements