Support Questions

Find answers, ask questions, and share your expertise

split orc file

avatar
Contributor

Hello,

is there a way to split 2 GB ORC file in to 50MB files?

We have many ORC files (larger than 1GB) in HDFS. We are planning to move those files to S3 and configure Hive external table to S3. The performance has been significantly affected by copying the larger files. If I split those files in to multiple files of 50MB or less and copy to S3 than the performance is comparable to HDFS (to test I created another table stored as ORC and insert the existing table data which created multiple files but that is not a viable solution as I have tables with multiple partition and many tables).

Is it possible to split the ORC files in to multiple files?

2 REPLIES 2

avatar
Super Guru
@bhavik shah

Are you willing to write a map reduce job? You can use mapreduce.input.fileinputformat.split.minsize to reduce the split size of your ORC file. The split is calculated using

max(mapreduce.input.fileinputformat.split.minsize, min(mapreduce.input.fileinputformat.split.maxsize
,dfs.blocksize))

Set your mapreduce.input.fileinputformat.split.minsize to 50 MB and then send the output of each mapper to S3.

Writing a mapreduce will be right way to do it. If you don't want to write a map reduce and instead use hive, then you will have to create a new table with same data and use Insert Select to populate new table:

set hive.merge.mapredfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.smallfiles.avgsize=51200000;
set hive.merge.size.per.task=51200000;

avatar
Contributor

Thank you for the response. I did it using creating the temp hive table