Support Questions

Find answers, ask questions, and share your expertise

Partition, file size and number of files question

avatar
Contributor

HI,

 

I have a big table, partitioned by year/month,total size 500+G, 19 partitions so far, each partion right now about 7-8 files each about 3-4G (total 138 files) i have five nodes running impala, it is very slow for a simple select, should i reduce the number of files in each partition and increase the size of each file?

 

Thanks

Shannon

14 REPLIES 14

avatar
Contributor

Also block size is 512M

avatar
Contributor

When i run a simple select with where id ='xxx', i see there are 1111 jobs/tasks(?), it is bcoz my block size is 512 and total is 500G, so 1000+ blocks to scan? If so i should increase my block size?

avatar
Champion

Somtimes I prefer bucketing over Partition due to large number of files getting created . 

Also in bucketing actually you have the control over the number of buckets.

I would suggest you test the bucketing over partition in your test env 

also it is a good practice to collect statistics for the table it will help in the performance side . 

avatar
Contributor

Thanks, i will take a look at bucketing.

 

Do you mean run compute stats, yes i did that.

avatar
Contributor

Took a quick look at bucketing, i dont have one column  that is used the most, and this table will join other tables later.

 

As far as controlling the number of files, since i have an idea of the data size, i can controll the file size when doing insert -select i can set hive.merge.size.per.task and hive.merge.smallfiles.avgsize.

 

My questions,

1, in this case, how big can i set the block size, 1G, 2G, will it hurt if set to big?

2, how big should i use for the file, right now i set to 4G, should i increase?

3, i have 5 nodes, with that for each partition it is better to have at least 5 files?

 

Thanks

Shannon

avatar
Super Collaborator

Hi Shannon,

 

Impala does not split up Parquet files over several readers when reading them. Instead, only one daemon will be assigned for each file and will read the whole file. Therefore it is recommended to have only one block per file. Otherwise some of the blocks can be on remote nodes and remote reads will slow down your queries. See this page for more information: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_perf_cookbook.html

 

Cheers, Lars

avatar
Contributor

Thanks Lars, trying to understand how it appies here, should i try to increate the blozks size to 4G, and keep each file under 4G (3-4G)?

 

Shannon

avatar
Super Collaborator

I'd try to reduce the file size to 256MB and make sure that the block size is at least that large, too. That way you should end up with 32GB * 4 = 128 files per partition. That should allow you to exploit parallelism across all your nodes. You can also try 512MB per file and see if that improves things, but I suspect it won't.

 

Btw, we're currently working on improving the ETL performance. You may want to look at the "SORT BY" clause that is included in Impala 2.9 and how it allows you to write data in a way that allows Impala to skip row groups much more effectively. You can find more information in the umbrella JIRA: https://issues.apache.org/jira/browse/IMPALA-2522

avatar
Contributor

But will the total files too many? i had smaller sized files before with 512M, was slow as well. we only have 5 nodes now so too many files will not help?

 

Thanks

Shannon