- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Controlling Number of small files while inserting into Hive table
- Labels:
-
Apache Hive
-
Apache Tez
Created ‎06-22-2017 05:51 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We do "insert into 'target_table' select a,b,c from x where .." kind of queries for a nightly load. This insert goes in a new partition of the target_table. Now the concern is : this inserts load hardly any data ( I would say less than 128 MB per day) but 1200 files. Each file in few KiloBytes. This is slowing down the performance. How can we make sure, this load does not generate lot of small files?
I have already set : hive.merge.mapfiles and hive.merge.mapredfiles to true in custom/advanced hive-site.xml. But still the load job loads data with 1200 small files. I know why 1200 is, this is the value of maximum number of reducers/containers available in one of the hive-sites. (I do not think its a good idea to do cluster wide setting, as this can affect other jobs which can use cluster when it has free containers) What could be other way/settings, so that the hive insert do not take 1200 slots and generate lots of small files?
I also have another question which is partly contrary to above : (This is relatively less important)
When I reload this table by creating another table by doing select on target table, this newly created table does not contain too many small files. What could be the reason?
Created ‎06-22-2017 11:38 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Smart Solutions You probably should try below if using Tea engine in Hive.
set hive.merge.tezfiles=true; set hive.merge.smallfiles.avgsize=128000000; set hive.merge.size.per.task=128000000;
Created ‎06-22-2017 06:52 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If source data is less than the block size on a daily basis then small file will be created per day which may be less than the block size. So many small files will be created on a daily basis. Once it is loaded into the table merge property has nothing to do it when you are loading the same table again. However it will have an impact when the source has many small files when the job is trigerred.
And the answer for the other question. As the first table has 1200 small files and merge.mapredfiles is set to true will enable the mapper to read as much of files and combine it, if the size of the files is less than the block size. So once the mapred job completes it will merge as much of files and push into hive table.
Hope it helps!!
Created ‎06-22-2017 02:29 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is your target table partitioned? If so, have you tried hive.optimize.sort.dynamic.partition ?
Providing target table DDL may be useful
Created ‎06-22-2017 11:38 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Smart Solutions You probably should try below if using Tea engine in Hive.
set hive.merge.tezfiles=true; set hive.merge.smallfiles.avgsize=128000000; set hive.merge.size.per.task=128000000;
Created ‎06-27-2017 09:08 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This works for us. thanks.
Created ‎06-23-2017 06:37 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can control the no of mappers launched in TEZ by setting parameter tez.grouping.split-count . Let's sat you have 1 GB table size(x) , you can set tez.grouping.split-count=10 , so each mapper will process 100MB data , resuting in 10 mappers.
Created ‎06-24-2017 05:17 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maybe consider using clustering/bucketing? It will allows you to specify a fixed number of files per partion. Answer on: How many buckets does your table need.
Created ‎06-24-2017 05:27 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My links aren't showing up but you can find discussion on this site about bucketing.
Created ‎06-24-2017 05:33 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When I reload this table by creating another table by doing select on target table, this newly created table does not contain too many small files. What could be the reason?
There are lots of factors that go into how many files are output. To use map-reduce terms how many reducers that are used. It's possible depending on a lot of factors, how many reducers were allowed for each type of workload. Even with this did you use the same engine to create both. (Did you use spark & hive, or just hive in both cases?) Did you run both job under the same user? with the same config?... so yeah lots of things could effect it.
