Support Questions

saravanan_p1 · ‎05-17-2017

HI

I' am using INSERT OVERWRITE DIRECTORY to create a ORC file as shown below.

insert overwrite directory '/hdfs/temp_table/' stored as ORC

select COL_1 ,COL_2, COL_3...FROM TMP_TABLE;

I' am getting multiple part files in the results when i run this query , which i want to reduce to 1 single ORC file.

How can i merge all the small partfiles into a single file?

Thanks

balavignesh_nag · ‎05-17-2017

Hi @saravanan p

one way of doing it is modify the job to run it in single reducer so that the output will be a single file.

Use this property to change the reducer to one. set mapred.reduce.tasks=1;

By default the no of files inserted in a hive table depends on the size of a file, size of map job, size of reducer job. Based on the size no of files inserted in a hive table varies.

max(mapreduce.input.fileinputformat.split.minsize, min(mapreduce.input.fileinputformat.split.maxsize, dfs.block.size))

If you have reducers running, then you should also look at

hive.exec.max.created.files, mapred.reduce.tasks, hive.exec.reducers.bytes.per.reducer

Hope it helps!

bkosaraju · ‎07-25-2017

ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;

reducing the tasks may impact the overall the performance(however alter also run mr and consume resources.)

post to your insert you can run a alter table statement.

more on the same can be found at ORC documentation

https://orc.apache.org/docs/hive-ddl.html

harshaskm · ‎08-17-2017

Hi @saravanan p

Wondering if this might be of interest to you: I was able to concatenate an external table which originally had large number of small orc files into small number of huge orc files. I suppose it might work on internal tables as well

Our HDP is: 2.5.3

Cloudera Community

Support Questions

How to merge multiple part files while creating hive ORC files using "insert overwrite directory"