Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Controlling Number of small files while inserting into Hive table

avatar
Expert Contributor

Hi,

We do "insert into 'target_table' select a,b,c from x where .." kind of queries for a nightly load. This insert goes in a new partition of the target_table. Now the concern is : this inserts load hardly any data ( I would say less than 128 MB per day) but 1200 files. Each file in few KiloBytes. This is slowing down the performance. How can we make sure, this load does not generate lot of small files?

I have already set : hive.merge.mapfiles and hive.merge.mapredfiles to true in custom/advanced hive-site.xml. But still the load job loads data with 1200 small files. I know why 1200 is, this is the value of maximum number of reducers/containers available in one of the hive-sites. (I do not think its a good idea to do cluster wide setting, as this can affect other jobs which can use cluster when it has free containers) What could be other way/settings, so that the hive insert do not take 1200 slots and generate lots of small files?

I also have another question which is partly contrary to above : (This is relatively less important)

When I reload this table by creating another table by doing select on target table, this newly created table does not contain too many small files. What could be the reason?

1 ACCEPTED SOLUTION

avatar
Contributor

@Smart Solutions You probably should try below if using Tea engine in Hive.

set hive.merge.tezfiles=true;
set hive.merge.smallfiles.avgsize=128000000;
set hive.merge.size.per.task=128000000;

View solution in original post

8 REPLIES 8

avatar

@Smart Solutions

If source data is less than the block size on a daily basis then small file will be created per day which may be less than the block size. So many small files will be created on a daily basis. Once it is loaded into the table merge property has nothing to do it when you are loading the same table again. However it will have an impact when the source has many small files when the job is trigerred.

And the answer for the other question. As the first table has 1200 small files and merge.mapredfiles is set to true will enable the mapper to read as much of files and combine it, if the size of the files is less than the block size. So once the mapred job completes it will merge as much of files and push into hive table.

Hope it helps!!

avatar
Super Collaborator

Is your target table partitioned? If so, have you tried hive.optimize.sort.dynamic.partition ?

Providing target table DDL may be useful

avatar
Contributor

@Smart Solutions You probably should try below if using Tea engine in Hive.

set hive.merge.tezfiles=true;
set hive.merge.smallfiles.avgsize=128000000;
set hive.merge.size.per.task=128000000;

avatar
Expert Contributor

This works for us. thanks.

avatar
Contributor

@Smart Solutions

You can control the no of mappers launched in TEZ by setting parameter tez.grouping.split-count . Let's sat you have 1 GB table size(x) , you can set tez.grouping.split-count=10 , so each mapper will process 100MB data , resuting in 10 mappers.

avatar
Expert Contributor

Maybe consider using clustering/bucketing? It will allows you to specify a fixed number of files per partion. Answer on: How many buckets does your table need.

avatar
Expert Contributor

My links aren't showing up but you can find discussion on this site about bucketing.

avatar
Expert Contributor
When I reload this table by creating another table by doing select on target table, this newly created table does not contain too many small files. What could be the reason?

There are lots of factors that go into how many files are output. To use map-reduce terms how many reducers that are used. It's possible depending on a lot of factors, how many reducers were allowed for each type of workload. Even with this did you use the same engine to create both. (Did you use spark & hive, or just hive in both cases?) Did you run both job under the same user? with the same config?... so yeah lots of things could effect it.