Created 10-28-2015 08:42 AM
hi all,
i have a problem. I have one table in Impala, partition by year, month, day, name of the load file. Example
/user/hive/warehouse/table/year=2015/month=01/day=01/nameloadfile1/file.parq
/user/hive/warehouse/table/year=2015/month=01/day=01/nameloadfile1/file2.parq
/user/hive/warehouse/table/year=2015/month=01/day=01/nameloadfile1/file3.parq
the size of the file is very smaller (low MB). I would want combine all files in one.
/user/hive/warehouse/table/year=2015/month=01/day=01/nameloadfile1/combineFiles.parq
Someone have any idea for that.
Best regards
Created 11-03-2015 01:51 PM
It's best to do this within Impala. One option is just to create a new table then insert the data from the old one. E.g.
create table table2 like table1;
insert into table2 select * from table1;
If you only want to combine the files from a single partition, you can copy the data to a different table, drop the old partition, then insert into the new partition to produce a single compacted partition. E.g.
create table table2 like table1;
insert into table2 select * from table1 where partition_key=1;
alter table table1 drop partition (partition_key=1);
insert into table1 partition(partition_key=1) select * from table2;
drop table table2;
Created 11-21-2015 04:12 AM
Hi Tim,
I have the similar issue, within one single partition, there are multiple small files. I already followed the similar solution you suggested, however the number of files did not reduced, infact increased. Here is the link to my question with profile and other details. The blocksize I kept was 256 MB.
Please advise,
Thanks
Amit
Created 04-08-2016 03:22 AM
Similar problem, unsolved !
Create an unpartitioned table stored as parquet, then insert overwrite the table with select from an old partitioned table.
The new unpartitioned table contained 5 .parq files ( size in 70~80kb).
Why?
Created 04-08-2016 05:58 PM
If the query runs with multiple fragments e.g. on 5 nodes you can get one file per fragment. If you look at the query plan (i.e. explain <the query text>) it will show the degree of parallelism for the insert.
Created 09-26-2017 11:33 PM
how would you suggest doing the red parts with zero downtime? (= table1 partition_key=1 is always available)
create table table2 like table1;
insert into table2 select * from table1 where partition_key=1;
alter table table1 drop partition (partition_key=1);
insert into table1 partition(partition_key=1) select * from table2;
drop table table2;
Created 09-28-2017 12:45 PM
You can do this:
insert overwrite table1 partition(partition_key=1) select * from table1 where partition_key=1;
This process should mostly work as you'd expect.
However, there are few situations where this may cause problems:
- If you run concurrent "refresh" or "invalidate metadata" commands against that table/partition until the insert is complete, some queries may see missing or dupicate data from that partition (fix via refresh after the insert).
- Do not run concurrent "insert overwrite" against the same partition. You may end up with missing/dupicate data in that partition.
If you can guarantee that the above two situations are not a problem for you, then insert overwrite should work just fine.
Created 04-09-2016 12:49 AM
Created 04-09-2016 10:44 AM
If you set num_nodes=1 that will force it to run on a single node.
Created 04-10-2016 07:50 PM
Created 05-31-2017 03:34 AM
What will be happened if set num_nodes=1
Created 05-31-2017 09:16 AM
num_nodes=1 forces Impala to execute the query on a single node (machine), which will then only write a single parquet file per partition.
Created on 07-12-2017 01:34 AM - edited 07-12-2017 01:35 AM
I also had this issue, but succeeded by using parquet-tools. There is an option called merge and you are abble to manage it like this hadoop jar parquet-tools merge src-folder-with-parquet target-parquet-file
There is also a feature that you can give meaningful name to target file ))
Created 07-12-2017 02:46 PM
@barnoba we strongly recommend *not* to use parquet-tools merge unless you really know what you're doing. It is known to cause some pretty bad performance problems in some cases. The problem is that it takes the row groups from the existing file and moves them unmodified into a new file - it does *not* merge the row groups from the different files. This can actually give you the worst of both worlds - you lose parallelism because the files are big, but you have all the performance overhead of processing many small row groups.
Created 03-06-2018 09:34 AM
@Tim Armstrong
But with this solution we will lose the parallelism, no ? and it may be cause a slow insert queries..
Created 01-10-2020 02:35 AM
As it says in the documentation:
Because this option results in increased resource utilization on a single host, it could cause problems due to contention with other Impala statements or high resource usage. Symptoms could include queries running slowly, exceeding the memory limit, or appearing to hang. Use it only in a single-user development/test environment; do not use it in a production environment or in a cluster with a high-concurrency or high-volume or performance-critical workload.