Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

Help understanding corrupt ORC file in Hive

Expert Contributor

@Ryan Chapin

Looking for some suggestions. We have a query that was hanging indefinitely in Hive on Tez. We are running with HDP 2.4.0. After some debugging, we narrowed it down to a single ORC file in a Hive partition that contained that file plus about 10 others. If we move this one file out of the partition and test the query, it now completes. If we include ONLY that one file in the partition, the query hangs. Even a simple "select * from table" hangs. The query never gets beyond the first map task. I then discovered the ORC file dump feature of Hive and ran the following on this file:

hive --orcfiledump --skip-dump --recover -d hdfs://vmwhaddev01:8020/tmp/000002_0 > orc.dump

This command never returns to the command line and hangs, similar to what Hive does. I Tested this on a known good file and the dump completes successfully and returns control to the command line as expected. So, even this dump test is hanging. If I tail orc.dump during the hang, the last line of the file looks complete.

So, I am wondering if this line is the end of a stripe and the next stripe is corrupt? The ORC reader seems to get into some infinite loop at this point. Once the dumped output file size stops increasing, at about 382 MB, top command shows the Hive dump process continuously using about 99.9% CPU until I CTRL+C the command. It dumps about 382 MB of data before the dump command hangs. Here's last line which is complete:


I am trying to determine if I have uncovered a bug or somehow the data that I am inserting into the ORC somehow resulted in this condition. Either way, it seems like a bug if you can insert data that causes an ORC file to become corrupted. The ingest pipeline for this data is as follows.

  1. I convert raw CSV files into Avro and land them in an HDFS directory. There could be multiple Avro schemas in play here as there are multiple versions of these CSV files in flight. The Avro schemas are designed such that I can include all versions in the same Hive table. Typically, newer versions of these stats files add more columns of stats.
  2. Once a day, I move all the Avro files that have accumulated to a temp directory and create an external table over the files in that directory
  3. I run a query that selects * from the external table and inserts all the results into another Hive managed table that is in ORC format, effectively using Hive to perform the Avro to ORC conversion. This query also performs a join with some data from one other table to enrich the data landing in the ORC table. This table is partitioned by year/month/day.
  4. Because the resulting ORC files are relatively small for HDFS, I perform one final step after the ORC insert query completes. I run a Hive query against the newly created partition to effectively compact the ORC files. Typically, the reduce part generates around 70 ORC files. I run a query like the following for the appropriate year, month, and day of the partition just created which typically compacts all 70 ORC files into about 5 much larger ones that are about 2-3 HDFS blocks (128 MB) in size each.
alter table table_name partition (year=2016, month=3, day=16) concatenate;

This is the first such issue we've seen in over two months of ingesting such files in this manner.

  • Does anyone have any ideas of where to look further to possibly understand the root cause of this problem?
  • Maybe the concatenate operation happened to cause the file corruption in this case? Anyone heard of such a thing?
  • Should I file a bug report and provide this corrupt ORC file for some forensic analysis? I don't really want to start trying to hex dump and decode ORC to figure out what happened.
    4 REPLIES 4

    Master Mentor


    Just to add some additional information to this: We also ran the same hive query using MR as the hive execution engine and it behaved the same way. Perhaps it is a problem with the ORC related serialization classes?


    I should have been more specific in my last comment; the result of trying to do too many things at once and not taking the time to properly craft a comment. So it seems obvious that this is an issue with the ORC SerDe code, but specifically it seems to be related to that which reads each of the records in a given column.

    It /seems/ that the metadata for the stripes is valid. With only the 'corrupt' file in place, doing a

    SELECT COUNT(1) FROM vsat_lmtd WHERE year=2016 AND month=3 AND day=8;

    results in: 1810465 records

    Dumping the metadata about the same file with the command

    hive --orcfiledump <path-to-file>

    Indicates the same number of records for the file:

    File Statistics:
       Column 0: count: 1810465 hasNull: false

    Grepping through the output of the aforementioned command indicates that the column for which we are having the problem /seems/ to have the same number of records, per stripe, that every other column in each stripe has. Also, looking at the overall average number of bytes per record in the files in this same partition shows only a few percentage points difference between each of the files, so I am assuming that number of records reflected in the stripe metadata is an accurate account of what is actually in the file.

    Does anyone here know how to parse an ORC file to separate out the data in each stripe to its own file? Doing so might help us to isolate the problem to a specific record or records.


    How big is this specific ORC file and can this be shared with us ?

    Can you also check if this is hanging in one of the mapper (that is reading this ORC file) or before you get into application/mapper in YARN.