Support Questions

Find answers, ask questions, and share your expertise

How to merge multiple part files while creating hive ORC files using "insert overwrite directory"

avatar
New Contributor

HI

I' am using INSERT OVERWRITE DIRECTORY to create a ORC file as shown below.

insert overwrite directory '/hdfs/temp_table/' stored as ORC

select COL_1 ,COL_2, COL_3...FROM TMP_TABLE;

I' am getting multiple part files in the results when i run this query , which i want to reduce to 1 single ORC file.

How can i merge all the small partfiles into a single file?

Thanks

3 REPLIES 3

avatar

Hi @saravanan p

one way of doing it is modify the job to run it in single reducer so that the output will be a single file.

Use this property to change the reducer to one. set mapred.reduce.tasks=1;

By default the no of files inserted in a hive table depends on the size of a file, size of map job, size of reducer job. Based on the size no of files inserted in a hive table varies.

max(mapreduce.input.fileinputformat.split.minsize, min(mapreduce.input.fileinputformat.split.maxsize, dfs.block.size))

If you have reducers running, then you should also look at

hive.exec.max.created.files, mapred.reduce.tasks, hive.exec.reducers.bytes.per.reducer

Hope it helps!

avatar
Super Collaborator
ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;

reducing the tasks may impact the overall the performance(however alter also run mr and consume resources.)

post to your insert you can run a alter table statement.

more on the same can be found at ORC documentation

https://orc.apache.org/docs/hive-ddl.html

avatar
New Contributor

Hi @saravanan p

Wondering if this might be of interest to you: I was able to concatenate an external table which originally had large number of small orc files into small number of huge orc files. I suppose it might work on internal tables as well

Our HDP is: 2.5.3