Support Questions

bshah1 · ‎04-17-2017

Hello,

is there a way to split 2 GB ORC file in to 50MB files?

We have many ORC files (larger than 1GB) in HDFS. We are planning to move those files to S3 and configure Hive external table to S3. The performance has been significantly affected by copying the larger files. If I split those files in to multiple files of 50MB or less and copy to S3 than the performance is comparable to HDFS (to test I created another table stored as ORC and insert the existing table data which created multiple files but that is not a viable solution as I have tables with multiple partition and many tables).

Is it possible to split the ORC files in to multiple files?

mqureshi · ‎04-17-2017

@bhavik shah

Are you willing to write a map reduce job? You can use mapreduce.input.fileinputformat.split.minsize to reduce the split size of your ORC file. The split is calculated using

max(mapreduce.input.fileinputformat.split.minsize, min(mapreduce.input.fileinputformat.split.maxsize
,dfs.blocksize))

Set your mapreduce.input.fileinputformat.split.minsize to 50 MB and then send the output of each mapper to S3.

Writing a mapreduce will be right way to do it. If you don't want to write a map reduce and instead use hive, then you will have to create a new table with same data and use Insert Select to populate new table:

set hive.merge.mapredfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.smallfiles.avgsize=51200000;
set hive.merge.size.per.task=51200000;

bshah1 · ‎04-19-2017

Thank you for the response. I did it using creating the temp hive table

Cloudera Community

Support Questions

split orc file

Reading ORC files using Mapreduce

How to compact ORC files on Hive.

Performance Comparison b/w ORC SNAPPY and ZLib in ...

ORC Improvements for Apache Spark 2.2

Malformed ORC file Invalid postscript

How Region Split works in HBase.

HDF/NiFi to convert row-formatted text files to co...

XML Processing: Encoding, Validation, Parsing & Sp...

Optimizing Hive queries for ORC formatted tables

How Hive determines the number of splits