Created 02-11-2019 02:34 PM
Hello,
I face to one error when I try to read my Orc files from Hive (external table) or Pig or with hive --orcfiledump ..
These files are generated with Flink using the Orc Java API with Vectorize column.
If I create these files locally (/tmp/...), push them to hdfs, then I can read the content of these files from Pig or with the use of External table in Hive.
If I change the path and use hdfs directly, then I face to this error :
Failure while running task:java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:54) at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:302) at org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$TimestampTreeReader.next(TreeReaderFactory.java:1105) at org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$StructTreeReader.next(TreeReaderFactory.java:2079) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1082) at org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat$OrcRecordReader.nextKeyValue(OrcNewInputFormat.java:108)
And the same if I get these files locally.
Created 02-17-2019 06:11 PM
In fact the problem is related to the java Orc api when parallelism is activated (multi-thread)
I use Flink and when I set a parallelism > 1 on the Sink that generates Orc files, I face to this issue: data are unreadable.
I've seen some tickets about this issue like this one: https://jira.apache.org/jira/browse/ORC-361
At the moment I use a parallelism of 1 but I have to fix this issue in order to scale my ingest pipeline.
All help is welcome.
Thx