Support Questions

Find answers, ask questions, and share your expertise

How to merge orc files in nifi?

avatar
Explorer

Hi,

I'm working with nifi and I need to merge orc files. I get the files from s3 bucket.

The flow goes like that:

List s3 -> Fetch s3 objects -> merge content(output as avro) -> convert Avro to orc -> put s3 object.

(I also tried only the merge content without outputing as Avro and converting to orc but it didn't work - The merged file was not valid).

I get an error at the MergeContent processor:

ERROR MergeContent[id=XXXX] Failed to process bundle of 2 files due to org.apache.nifi.processor.exception.ProcessException: IO thrown from MergeContent[id=XXXX]: java.io.IOException: Not a data file.

1 REPLY 1

avatar
Master Guru
@gal itzhak
List s3 -> Fetch s3 objects -> merge content(output as avro) -> convert Avro to orc -> put s3 object

the above approach is correct merge content processor won't support merge format as orc,we still needs to merge all the avro files into one then feed it into AvroToOrc processor.

Supported Merge Formats in NiFi:

76612-mergecontent.png

You can either use Merge Record processor also which reads incoming flowfile and writes the merged flowfile based on the configured Record Writer and merges the records based on the configurations.


How to merge small orc files then?


Merging small orc files we still need to do through hive/spark

Compacting small files using Concatenate:

As your storing the the orc file to S3 then you can merge orc files and if you are having hive table on top of s3 orc files.

Use Alter table concatenate to merge of small ORC files together by issuing a CONCATENATE command on their table or partition. The files will be merged at the stripe level without reserialization

ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;

refer to this link for more details regarding concatenate.

(or)

2.Compacting small files without using Concatenate:

step1:
Let's assume your final orc table having thousands of small orc files then Create a temporary table by selecting the final table as

hive> create table <db.name>.<temp_table_name> stored as orc as select * from <db_name>.<final_table>;

step2:
Now we have created temp_table by selecting all the data from final table, then again overwrite the final table by selecting the temp table data you can use order by/sort by/distribute by clauses to create new files in the final table with the even distribution.

hive> insert overwrite table <db_name>.<final_table> select * from <db.name>.<temp_table_name> order by/sort by <some column>;

in addition you can set all the hive session properties before overwriting the final table.

By following this approach until the overwrite job gets completed, make sure any other applications are not writing data into final table because we are doing overwrite from temp table if other applications are writing the data to final table we are going to loose the data.

Refer to this ,this and this links will describe more details regarding how to compact small files.

-

If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.