Created 04-17-2017 04:09 AM
Hello,
is there a way to split 2 GB ORC file in to 50MB files?
We have many ORC files (larger than 1GB) in HDFS. We are planning to move those files to S3 and configure Hive external table to S3. The performance has been significantly affected by copying the larger files. If I split those files in to multiple files of 50MB or less and copy to S3 than the performance is comparable to HDFS (to test I created another table stored as ORC and insert the existing table data which created multiple files but that is not a viable solution as I have tables with multiple partition and many tables).
Is it possible to split the ORC files in to multiple files?
Created 04-17-2017 04:45 AM
Are you willing to write a map reduce job? You can use mapreduce.input.fileinputformat.split.minsize to reduce the split size of your ORC file. The split is calculated using
max(mapreduce.input.fileinputformat.split.minsize, min(mapreduce.input.fileinputformat.split.maxsize
,
dfs.blocksize))
Set your mapreduce.input.fileinputformat.split.minsize to 50 MB and then send the output of each mapper to S3.
Writing a mapreduce will be right way to do it. If you don't want to write a map reduce and instead use hive, then you will have to create a new table with same data and use Insert Select to populate new table:
set hive.merge.mapredfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.smallfiles.avgsize=51200000;
set hive.merge.size.per.task=51200000;
Created 04-19-2017 08:23 PM
Thank you for the response. I did it using creating the temp hive table