I' am using INSERT OVERWRITE DIRECTORY to create a ORC file as shown below.
insert overwrite directory '/hdfs/temp_table/' stored as ORC
select COL_1 ,COL_2, COL_3...FROM TMP_TABLE;
I' am getting multiple part files in the results when i run this query , which i want to reduce to 1 single ORC file.
How can i merge all the small partfiles into a single file?
Hi @saravanan p
one way of doing it is modify the job to run it in single reducer so that the output will be a single file.
Use this property to change the reducer to one. set mapred.reduce.tasks=1;
By default the no of files inserted in a hive table depends on the size of a file, size of map job, size of reducer job. Based on the size no of files inserted in a hive table varies.
max(mapreduce.input.fileinputformat.split.minsize, min(mapreduce.input.fileinputformat.split.maxsize, dfs.block.size))
If you have reducers running, then you should also look at
hive.exec.max.created.files, mapred.reduce.tasks, hive.exec.reducers.bytes.per.reducer
Hope it helps!
ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;
reducing the tasks may impact the overall the performance(however alter also run mr and consume resources.)
post to your insert you can run a alter table statement.
more on the same can be found at ORC documentation
Hi @saravanan p
Wondering if this might be of interest to you: I was able to concatenate an external table which originally had large number of small orc files into small number of huge orc files. I suppose it might work on internal tables as well
Our HDP is: 2.5.3