- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
split orc file
- Labels:
-
Apache Hadoop
Created ‎04-17-2017 04:09 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
is there a way to split 2 GB ORC file in to 50MB files?
We have many ORC files (larger than 1GB) in HDFS. We are planning to move those files to S3 and configure Hive external table to S3. The performance has been significantly affected by copying the larger files. If I split those files in to multiple files of 50MB or less and copy to S3 than the performance is comparable to HDFS (to test I created another table stored as ORC and insert the existing table data which created multiple files but that is not a viable solution as I have tables with multiple partition and many tables).
Is it possible to split the ORC files in to multiple files?
Created ‎04-17-2017 04:45 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you willing to write a map reduce job? You can use mapreduce.input.fileinputformat.split.minsize to reduce the split size of your ORC file. The split is calculated using
max(mapreduce.input.fileinputformat.split.minsize, min(mapreduce.input.fileinputformat.split.maxsize
,
dfs.blocksize))
Set your mapreduce.input.fileinputformat.split.minsize to 50 MB and then send the output of each mapper to S3.
Writing a mapreduce will be right way to do it. If you don't want to write a map reduce and instead use hive, then you will have to create a new table with same data and use Insert Select to populate new table:
set hive.merge.mapredfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.smallfiles.avgsize=51200000;
set hive.merge.size.per.task=51200000;
Created ‎04-19-2017 08:23 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the response. I did it using creating the temp hive table
