I have a big table, partitioned by year/month,total size 500+G, 19 partitions so far, each partion right now about 7-8 files each about 3-4G (total 138 files) i have five nodes running impala, it is very slow for a simple select, should i reduce the number of files in each partition and increase the size of each file?
When i run a simple select with where id ='xxx', i see there are 1111 jobs/tasks(?), it is bcoz my block size is 512 and total is 500G, so 1000+ blocks to scan? If so i should increase my block size?
Somtimes I prefer bucketing over Partition due to large number of files getting created .
Also in bucketing actually you have the control over the number of buckets.
I would suggest you test the bucketing over partition in your test env
also it is a good practice to collect statistics for the table it will help in the performance side .
Took a quick look at bucketing, i dont have one column that is used the most, and this table will join other tables later.
As far as controlling the number of files, since i have an idea of the data size, i can controll the file size when doing insert -select i can set hive.merge.size.per.task and hive.merge.smallfiles.avgsize.
1, in this case, how big can i set the block size, 1G, 2G, will it hurt if set to big?
2, how big should i use for the file, right now i set to 4G, should i increase?
3, i have 5 nodes, with that for each partition it is better to have at least 5 files?
Impala does not split up Parquet files over several readers when reading them. Instead, only one daemon will be assigned for each file and will read the whole file. Therefore it is recommended to have only one block per file. Otherwise some of the blocks can be on remote nodes and remote reads will slow down your queries. See this page for more information: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_perf_cookbook.html
Thanks Lars, trying to understand how it appies here, should i try to increate the blozks size to 4G, and keep each file under 4G (3-4G)?
I'd try to reduce the file size to 256MB and make sure that the block size is at least that large, too. That way you should end up with 32GB * 4 = 128 files per partition. That should allow you to exploit parallelism across all your nodes. You can also try 512MB per file and see if that improves things, but I suspect it won't.
Btw, we're currently working on improving the ETL performance. You may want to look at the "SORT BY" clause that is included in Impala 2.9 and how it allows you to write data in a way that allows Impala to skip row groups much more effectively. You can find more information in the umbrella JIRA: https://issues.apache.org/jira/browse/IMPALA-2522
But will the total files too many? i had smaller sized files before with 512M, was slow as well. we only have 5 nodes now so too many files will not help?