I think the question above says it all, but to add more background, we have pig scripts landing data in ORC files hourly and are ending up with numerous small files in a partition directory. alter table partition concatenate appears to run (we end up with different file names in the partition folder), but doesn't seem to be actually concatenating into fewer files. I'm wondering if the problem is the fact that these are external tables.... but the fact that files within the partition are getting renamed suggests it's not.
@Vincent Romeo concatenate orc files on external tables works. did you try using mr vs tez ? if you launch your concatenate in Hive, could you put some debug on (hive --hiveconf hive.root.logger=INFO,console) and let's have a look at the logs?
We have a requirement to concatenate huge number of small ORC files into small number of big ORC files, through this post I was encouraged to try it on our external table and it just worked fine.
This is a test that we ran on our dev environment, we created 100 small ORC files and then used the concatenate command it turned them into 5 big files (next we'll try to figure out how to influence the number of files/size)
Hey @Harsha Mohan, I am trying to do the same it is concatenating upto 2 files and after several runs it is converting to 1 big file. Do you any other approach that can convert to 1 big file in single run?
We've just been trying the same thing and find that Hive does not support CONCATENATE on external tables:
ALTER TABLE DD_EXT_ORC CONCATENATE; FAILED: SemanticException org.apache.hadoop.hive.ql.parse.SemanticException: Concatenate/Merge can only be performed on managed tables
Maybe it's a limitation that was introduced in more recent versions of Hive. I tested using HDP-18.104.22.168 which has Hive 1.2.1000.
There's a workaround mentioned here (https://stackoverflow.com/questions/47794614/how-to-merge-orc-files-for-external-tables) which works when I try this:
ALTER TABLE DD_EXT_ORC SET TBLPROPERTIES('EXTERNAL'='FALSE'); ALTER TABLE DD_EXT_ORC CONCATENATE; ALTER TABLE DD_EXT_ORC SET TBLPROPERTIES('EXTERNAL'='TRUE');
It's a little unsatisfactory (needs a TRY/FINALLY to ensure the final line is executed but don't think Hive-SQL has exception handling) so I would love to hear from anybody who has a better solution... Perhaps there's a hidden Hive parameter allowing CONCATENATE on external tables.
Note that later versions of Hive (3.0+) appear not to support this workaround. To be marked as internal a table must be managed, hence transactional=true, but once a table is marked as transactional it cannot be reverted. See https://community.hortonworks.com/questions/129969/can-we-change-transactionalfalse-from-transaction...
My first statement(
ALTER TABLE DD_EXT_ORC SET TBLPROPERTIES('EXTERNAL'='FALSE');
) fails with the following error :
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. Table test failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional.