- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
CONTROL SIZE OUTPUT FILE SIZE WITHOUT ADDING MERGE PROPERTY IN HIVE
- Labels:
-
Apache Hive
Created 02-26-2018 07:21 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hI,
I am trying to reduce the small files but adding merge property affects the performance; since seperate job is triggered for this merge. Is there any way, to control the size of output file by mapper or reducer?
Thanks in advance!\
Mithuun
Created 02-27-2018 08:00 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Mithun
Having a merge step is definitely more full proof approach. Otherwise you will to know more of your data and distribution and set yourself. A first step would be hive.merge.smallfiles.avgsize that would add the extra step only of the average is not respected.
You can also set the number of reducers yourself either statically or dynamically based on the volume of data coming in and if you know your workload this will allow you to calculate the file output size roughly.
Seems like a trade off between a more generic approach with a merge step and a more granular approach in which you know your workload.
hope this helps
Created 02-27-2018 08:00 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Mithun
Having a merge step is definitely more full proof approach. Otherwise you will to know more of your data and distribution and set yourself. A first step would be hive.merge.smallfiles.avgsize that would add the extra step only of the average is not respected.
You can also set the number of reducers yourself either statically or dynamically based on the volume of data coming in and if you know your workload this will allow you to calculate the file output size roughly.
Seems like a trade off between a more generic approach with a merge step and a more granular approach in which you know your workload.
hope this helps
