Created on 08-10-2017 07:32 PM - edited 09-16-2022 05:04 AM
HI,
I have a big table, partitioned by year/month,total size 500+G, 19 partitions so far, each partion right now about 7-8 files each about 3-4G (total 138 files) i have five nodes running impala, it is very slow for a simple select, should i reduce the number of files in each partition and increase the size of each file?
Thanks
Shannon
Created 08-10-2017 07:47 PM
Also block size is 512M
Created 08-10-2017 08:08 PM
When i run a simple select with where id ='xxx', i see there are 1111 jobs/tasks(?), it is bcoz my block size is 512 and total is 500G, so 1000+ blocks to scan? If so i should increase my block size?
Created 08-10-2017 10:24 PM
Somtimes I prefer bucketing over Partition due to large number of files getting created .
Also in bucketing actually you have the control over the number of buckets.
I would suggest you test the bucketing over partition in your test env
also it is a good practice to collect statistics for the table it will help in the performance side .
Created 08-11-2017 04:57 AM
Thanks, i will take a look at bucketing.
Do you mean run compute stats, yes i did that.
Created 08-11-2017 06:04 AM
Took a quick look at bucketing, i dont have one column that is used the most, and this table will join other tables later.
As far as controlling the number of files, since i have an idea of the data size, i can controll the file size when doing insert -select i can set hive.merge.size.per.task and hive.merge.smallfiles.avgsize.
My questions,
1, in this case, how big can i set the block size, 1G, 2G, will it hurt if set to big?
2, how big should i use for the file, right now i set to 4G, should i increase?
3, i have 5 nodes, with that for each partition it is better to have at least 5 files?
Thanks
Shannon
Created 08-11-2017 09:41 AM
Hi Shannon,
Impala does not split up Parquet files over several readers when reading them. Instead, only one daemon will be assigned for each file and will read the whole file. Therefore it is recommended to have only one block per file. Otherwise some of the blocks can be on remote nodes and remote reads will slow down your queries. See this page for more information: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_perf_cookbook.html
Cheers, Lars
Created 08-11-2017 09:52 AM
Thanks Lars, trying to understand how it appies here, should i try to increate the blozks size to 4G, and keep each file under 4G (3-4G)?
Shannon
Created 08-11-2017 10:08 AM
I'd try to reduce the file size to 256MB and make sure that the block size is at least that large, too. That way you should end up with 32GB * 4 = 128 files per partition. That should allow you to exploit parallelism across all your nodes. You can also try 512MB per file and see if that improves things, but I suspect it won't.
Btw, we're currently working on improving the ETL performance. You may want to look at the "SORT BY" clause that is included in Impala 2.9 and how it allows you to write data in a way that allows Impala to skip row groups much more effectively. You can find more information in the umbrella JIRA: https://issues.apache.org/jira/browse/IMPALA-2522
Created 08-11-2017 10:46 AM
But will the total files too many? i had smaller sized files before with 512M, was slow as well. we only have 5 nodes now so too many files will not help?
Thanks
Shannon