I have a customer that has hundreds of tab separated data files they've dumped into a single directory in HDFS and defined a Hive schema over. The plan is to create a ORC based table (partitioning and bucketing, etc) and INSERT INTO TABLE from the TSV based table. We begin the process but run into the following error:
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3236) at org.apache.hadoop.io.Text.setCapacity(Text.java:266) at org.apache.hadoop.io.Text.append(Text.java:236) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:243) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:248) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350) at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79) at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:141) at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:325) at
Caused by: java.io.IOException: Too many bytes before newline: 2147483648 at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(Uncompreer.java:94) at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:136) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader
We have determined that at least one of the many TSV files in the directory does not conform to the newline record delimiter we provided in the schema and this is preventing the hive job from completing. If we can figure out which file Hive is working on we can remove it from the directory and get back to business. Is there a way to enable Hive to identify which file it's working on when this error is encountered? We're going to bump up the log level to see if the file info is revealed but are looking for alternatives in case that doesn't work.
Hi @Ryan Templeton.
Some suggestions (all loosely described). Don't want to seem trivial with some of these, but other people may end up with the same issue in the future.
These may all fail with the same error but may be worth a try.
The examples below assume that the data is partitioned by YYYY/MM/DD. This makes it ewasier to write a sample query.
Also assuming that all the input TSV files have a similar number of lines - which may make it easier to find the outlier file(s).
1. Do rowcounts work on the table? (not even trying the "insert into" code) I assume you've tried to do a rowcount selecting by the partition clause and you get the same error? If it does work, is there one YYYY/MM/DD result that has zero (or far fewer) rows?
select YR, MNTH, DY, count(*) from <tbl> group by YR, MNTH, DY;
2. If #1 doesn't work, Have you tried running the same query & limiting by a specific year? If a year works, them those files for that YYYY are golden. If a year fails then go down to YYYY/MM and YYYY/MM/DD and successively eliminate the good partitions/files until you get to the bad file. (this all depends on partition elimination working for the query).
select YR count(*) from <tbl> group by YR; Where YR = '2012'; -- then try13, 14, 15, 16
3. You could try running #1 query using virtual column INPUT__FILE__NAME.
select INPUT__FILE__NAME, count(*) from <tbl> group by INPUT__FILE__NAME;
4. Write a pig script that counts the number of newlines in the files. The invalid file might be the one returning zero rows.
I hope this helps.
My problem is that anything Hive query based isn't viable. As Hive starts reading the files in order to parse the data according to the table schema, the offending(corrupted) file causes the RecordReader to read until the heap completely fills up resulting in the OOM error. I ended up having to write an app that would only read the first N bytes of data looking for the row delimiter. It ran through the whole directory in just a few seconds and located the file that was causing the error. Here is a link to the code.
I had this problem and it turned out to be a malformed file/table/CSV. Please see, this issue occured in Spark SQL and Hive: http://stackoverflow.com/questions/41213879/spark-map-filter-throws-java-io-ioexception-too-many-byt...