Created 06-22-2017 05:51 AM
Hi,
We do "insert into 'target_table' select a,b,c from x where .." kind of queries for a nightly load. This insert goes in a new partition of the target_table. Now the concern is : this inserts load hardly any data ( I would say less than 128 MB per day) but 1200 files. Each file in few KiloBytes. This is slowing down the performance. How can we make sure, this load does not generate lot of small files?
I have already set : hive.merge.mapfiles and hive.merge.mapredfiles to true in custom/advanced hive-site.xml. But still the load job loads data with 1200 small files. I know why 1200 is, this is the value of maximum number of reducers/containers available in one of the hive-sites. (I do not think its a good idea to do cluster wide setting, as this can affect other jobs which can use cluster when it has free containers) What could be other way/settings, so that the hive insert do not take 1200 slots and generate lots of small files?
I also have another question which is partly contrary to above : (This is relatively less important)
When I reload this table by creating another table by doing select on target table, this newly created table does not contain too many small files. What could be the reason?
Created 06-22-2017 11:38 PM
@Smart Solutions You probably should try below if using Tea engine in Hive.
set hive.merge.tezfiles=true; set hive.merge.smallfiles.avgsize=128000000; set hive.merge.size.per.task=128000000;
Created 06-22-2017 06:52 AM
If source data is less than the block size on a daily basis then small file will be created per day which may be less than the block size. So many small files will be created on a daily basis. Once it is loaded into the table merge property has nothing to do it when you are loading the same table again. However it will have an impact when the source has many small files when the job is trigerred.
And the answer for the other question. As the first table has 1200 small files and merge.mapredfiles is set to true will enable the mapper to read as much of files and combine it, if the size of the files is less than the block size. So once the mapred job completes it will merge as much of files and push into hive table.
Hope it helps!!
Created 06-22-2017 02:29 PM
Is your target table partitioned? If so, have you tried hive.optimize.sort.dynamic.partition ?
Providing target table DDL may be useful
Created 06-22-2017 11:38 PM
@Smart Solutions You probably should try below if using Tea engine in Hive.
set hive.merge.tezfiles=true; set hive.merge.smallfiles.avgsize=128000000; set hive.merge.size.per.task=128000000;
Created 06-27-2017 09:08 AM
This works for us. thanks.
Created 06-23-2017 06:37 PM
You can control the no of mappers launched in TEZ by setting parameter tez.grouping.split-count . Let's sat you have 1 GB table size(x) , you can set tez.grouping.split-count=10 , so each mapper will process 100MB data , resuting in 10 mappers.
Created 06-24-2017 05:17 PM
Maybe consider using clustering/bucketing? It will allows you to specify a fixed number of files per partion. Answer on: How many buckets does your table need.
Created 06-24-2017 05:27 PM
My links aren't showing up but you can find discussion on this site about bucketing.
Created 06-24-2017 05:33 PM
When I reload this table by creating another table by doing select on target table, this newly created table does not contain too many small files. What could be the reason?
There are lots of factors that go into how many files are output. To use map-reduce terms how many reducers that are used. It's possible depending on a lot of factors, how many reducers were allowed for each type of workload. Even with this did you use the same engine to create both. (Did you use spark & hive, or just hive in both cases?) Did you run both job under the same user? with the same config?... so yeah lots of things could effect it.