I'm working with nifi and I need to merge orc files. I get the files from s3 bucket.
The flow goes like that:
List s3 -> Fetch s3 objects -> merge content(output as avro) -> convert Avro to orc -> put s3 object.
(I also tried only the merge content without outputing as Avro and converting to orc but it didn't work - The merged file was not valid).
I get an error at the MergeContent processor:
ERROR MergeContent[id=XXXX] Failed to process bundle of 2 files due to org.apache.nifi.processor.exception.ProcessException: IO thrown from MergeContent[id=XXXX]: java.io.IOException: Not a data file.
List s3 -> Fetch s3 objects -> merge content(output as avro) -> convert Avro to orc -> put s3 object
the above approach is correct merge content processor won't support merge format as orc,we still needs to merge all the avro files into one then feed it into AvroToOrc processor.
Supported Merge Formats in NiFi:
How to merge small orc files then?
Merging small orc files we still need to do through hive/spark
Compacting small files using Concatenate:
As your storing the the orc file to S3 then you can merge orc files and if you are having hive table on top of s3 orc files.
Use Alter table concatenate to merge of small ORC files together by issuing a CONCATENATE command on their table or partition. The files will be merged at the stripe level without reserialization
ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;
refer to this link for more details regarding concatenate.
2.Compacting small files without using Concatenate:
Let's assume your final orc table having thousands of small orc files then Create a temporary table by selecting the final table as
hive> create table <db.name>.<temp_table_name> stored as orc as select * from <db_name>.<final_table>;
Now we have created temp_table by selecting all the data from final table, then again overwrite the final table by selecting the temp table data you can use order by/sort by/distribute by clauses to create new files in the final table with the even distribution.
hive> insert overwrite table <db_name>.<final_table> select * from <db.name>.<temp_table_name> order by/sort by <some column>;
in addition you can set all the hive session properties before overwriting the final table.
By following this approach until the overwrite job gets completed, make sure any other applications are not writing data into final table because we are doing overwrite from temp table if other applications are writing the data to final table we are going to loose the data.
If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.