I have an external partitioned table and the partition is based on 4 CSV files of < 2GB each. One file causes the problem as soon as added to the partition directory.
Splitting the file into two equal parts (based on row count 4071510/2) using the command line "split" command solves the problem. I am completely unable to figure out what the problem is. Here's the stack trace:
Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1461850869883_0002_6_00, diagnostics=[Task failed, taskId=task_1461850869883_0002_6_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.lang.IndexOutOfBoundsException
Hive doesn't have the limit on the max row count. You said its external partitioned table, did you add partitions using
No, I added the partition using:
ALTER TABLE mytable ADD PARTITION (partitioncolumn="2016-04-30") LOCATION '/user/data/partitioncolumn=2016-04-30'
By the way, splitting the file into two and adding to the same partition folder works. When I put the complete file in the folder, it always gave me this exception.
Worked like a charm! Could you please also leave a short comment so that I could understand what was happening and how this setting fixed it?