07-12-2017 01:34 AM - edited 07-12-2017 01:35 AM
I also had this issue, but succeeded by using parquet-tools. There is an option called merge and you are abble to manage it like this hadoop jar parquet-tools merge src-folder-with-parquet target-parquet-file
There is also a feature that you can give meaningful name to target file ))
07-12-2017 02:46 PM
@barnoba we strongly recommend *not* to use parquet-tools merge unless you really know what you're doing. It is known to cause some pretty bad performance problems in some cases. The problem is that it takes the row groups from the existing file and moves them unmodified into a new file - it does *not* merge the row groups from the different files. This can actually give you the worst of both worlds - you lose parallelism because the files are big, but you have all the performance overhead of processing many small row groups.
09-26-2017 11:33 PM
how would you suggest doing the red parts with zero downtime? (= table1 partition_key=1 is always available)
create table table2 like table1;
insert into table2 select * from table1 where partition_key=1;
alter table table1 drop partition (partition_key=1);
insert into table1 partition(partition_key=1) select * from table2;
drop table table2;
09-28-2017 12:45 PM
You can do this:
insert overwrite table1 partition(partition_key=1) select * from table1 where partition_key=1;
This process should mostly work as you'd expect.
However, there are few situations where this may cause problems:
- If you run concurrent "refresh" or "invalidate metadata" commands against that table/partition until the insert is complete, some queries may see missing or dupicate data from that partition (fix via refresh after the insert).
- Do not run concurrent "insert overwrite" against the same partition. You may end up with missing/dupicate data in that partition.
If you can guarantee that the above two situations are not a problem for you, then insert overwrite should work just fine.