Created 03-19-2016 07:52 PM
Looking for some suggestions. We have a query that was hanging indefinitely in Hive on Tez. We are running with HDP 2.4.0. After some debugging, we narrowed it down to a single ORC file in a Hive partition that contained that file plus about 10 others. If we move this one file out of the partition and test the query, it now completes. If we include ONLY that one file in the partition, the query hangs. Even a simple "select * from table" hangs. The query never gets beyond the first map task. I then discovered the ORC file dump feature of Hive and ran the following on this file:
hive --orcfiledump --skip-dump --recover -d hdfs://vmwhaddev01:8020/tmp/000002_0 > orc.dump
This command never returns to the command line and hangs, similar to what Hive does. I Tested this on a known good file and the dump completes successfully and returns control to the command line as expected. So, even this dump test is hanging. If I tail orc.dump during the hang, the last line of the file looks complete.
So, I am wondering if this line is the end of a stripe and the next stripe is corrupt? The ORC reader seems to get into some infinite loop at this point. Once the dumped output file size stops increasing, at about 382 MB, top command shows the Hive dump process continuously using about 99.9% CPU until I CTRL+C the command. It dumps about 382 MB of data before the dump command hangs. Here's last line which is complete:
{"_col0":"DSN001000325021","_col1":10784199,"_col2":1457431200,"_col3":1457434800,"_col4":20209,"_col5":0,"_col6":60,"_col7":10,"_col8":1456222331,"_col9":120,"_col10":117,"_col11":114,"_col12":0,"_col13":0,"_col14":0,"_col15":0,"_col16":0,"_col17":0,"_col18":120,"_col19":121,"_col20":123,"_col21":124,"_col22":125,"_col23":0,"_col24":0,"_col25":15296815,"_col26":2,"_col27":0,"_col28":163528,"_col29":88,"_col30":1498,"_col31":29082,"_col32":874908,"_col33":51565,"_col34":104138,"_col35":149,"_col36":6,"_col37":0,"_col38":0,"_col39":1508,"_col40":0,"_col41":2248,"_col42":46961,"_col43":624,"_col44":1732,"_col45":0,"_col46":0,"_col47":0,"_col48":0,"_col49":41,"_col50":159,"_col51":12,"_col52":30,"_col53":0,"_col54":0,"_col55":0,"_col56":0,"_col57":0,"_col58":0,"_col59":0,"_col60":0,"_col61":1398668,"_col62":5916915,"_col63":5916855,"_col64":5986115,"_col65":249,"_col66":66,"_col67":547,"_col68":76,"_col69":132618,"_col70":17398,"_col71":140325,"_col72":19012,"_col73":0,"_col74":0,"_col75":0,"_col76":0,"_col77":"TUC04HNSIGW63B002Adv","_col78":1456805959,"_col79":0,"_col80":158,"_col81":136,"_col82":0,"_col83":0,"_col84":1,"_col85":0,"_col86":12,"_col87":12,"_col88":0,"_col89":0,"_col90":0,"_col91":0,"_col92":12,"_col93":12,"_col94":0,"_col95":0,"_col96":0,"_col97":0,"_col98":18,"_col99":14,"_col100":0,"_col101":0,"_col102":0,"_col103":0,"_col104":12,"_col105":12,"_col106":0,"_col107":0,"_col108":0,"_col109":0,"_col110":51565,"_col111":37383,"_col112":0,"_col113":402,"_col114":0,"_col115":449,"_col116":3126,"_col117":46256,"_col118":28682,"_col119":4,"_col120":0,"_col121":0,"_col122":0,"_col123":0,"_col124":0,"_col125":0,"_col126":0,"_col127":0,"_col128":0,"_col129":0,"_col130":0,"_col131":0,"_col132":0,"_col133":0,"_col134":0,"_col135":0,"_col136":0,"_col137":0,"_col138":0,"_col139":0,"_col140":0,"_col141":0,"_col142":0,"_col143":"20.2.1-3569","_col144":0,"_col145":0,"_col146":0,"_col147":0,"_col148":0,"_col149":0,"_col150":0,"_col151":0,"_col152":0,"_col153":0,"_col154":0,"_col155":0,"_col156":0,"_col157":0,"_col158":0,"_col159":0,"_col160":0,"_col161":0,"_col162":0,"_col163":0,"_col164":0,"_col165":0,"_col166":0,"_col167":0,"_col168":0,"_col169":0,"_col170":0,"_col171":0,"_col172":0,"_col173":4,"_col174":"12-AUG-15","_col175":12,"_col176":"HT1100","_col177":"1","_col178":"B4WB16S2","_col179":"CORE_DSN_PROD_HT1100_50K","_col180":"SW_DSN_PROD_HT1100_50K","_col181":"3.2.0.24","_col182":"1000"}
I am trying to determine if I have uncovered a bug or somehow the data that I am inserting into the ORC somehow resulted in this condition. Either way, it seems like a bug if you can insert data that causes an ORC file to become corrupted. The ingest pipeline for this data is as follows.
alter table table_name partition (year=2016, month=3, day=16) concatenate;
This is the first such issue we've seen in over two months of ingesting such files in this manner.
Created 03-19-2016 09:44 PM
Created 03-19-2016 10:22 PM
Just to add some additional information to this: We also ran the same hive query using MR as the hive execution engine and it behaved the same way. Perhaps it is a problem with the ORC related serialization classes?
Created 03-20-2016 05:12 PM
I should have been more specific in my last comment; the result of trying to do too many things at once and not taking the time to properly craft a comment. So it seems obvious that this is an issue with the ORC SerDe code, but specifically it seems to be related to that which reads each of the records in a given column.
It /seems/ that the metadata for the stripes is valid. With only the 'corrupt' file in place, doing a
SELECT COUNT(1) FROM vsat_lmtd WHERE year=2016 AND month=3 AND day=8;
results in: 1810465 records
Dumping the metadata about the same file with the command
hive --orcfiledump <path-to-file>
Indicates the same number of records for the file:
File Statistics: Column 0: count: 1810465 hasNull: false
Grepping through the output of the aforementioned command indicates that the column for which we are having the problem /seems/ to have the same number of records, per stripe, that every other column in each stripe has. Also, looking at the overall average number of bytes per record in the files in this same partition shows only a few percentage points difference between each of the files, so I am assuming that number of records reflected in the stripe metadata is an accurate account of what is actually in the file.
Does anyone here know how to parse an ORC file to separate out the data in each stripe to its own file? Doing so might help us to isolate the problem to a specific record or records.
Created 05-02-2016 09:00 PM
How big is this specific ORC file and can this be shared with us ?
Can you also check if this is hanging in one of the mapper (that is reading this ORC file) or before you get into application/mapper in YARN.