Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Re writing Avro map reduce to Parquet map reduce

avatar
Champion Alumni

Hello All,

 

We have a  java map reduce application which reads in binary files does some  data processing and converts to avro data.Currently we have two avro schemas and use Avromultipleoutputs class to write to multiple locations based on the schema.After we did  some  research we found that it would be   beneficial if we could store the data as parquet.What is the best way to do this?Should I change the native map reduce  to convert from avro to parquet or is there some other utility that I can use?.

 

Thanks,

Nishan

1 ACCEPTED SOLUTION

avatar
Champion Alumni

I tried using AvroParquetOutputFormat and MultipleOutputs class and was able to generate parquet files for a specific schema type.For the other schema type I am running into the below error.Any help is appreciated?

 

java.lang.ArrayIndexOutOfBoundsException: 2820
at org.apache.parquet.io.api.Binary.hashCode(Binary.java:489)
at org.apache.parquet.io.api.Binary.access$100(Binary.java:34)
at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.hashCode(Binary.java:382)
at org.apache.parquet.it.unimi.dsi.fastutil.objects.Object2IntLinkedOpenHashMap.getInt(Object2IntLinkedOpenHashMap.java:587)
at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.writeBytes(DictionaryValuesWriter.java:235)
at org.apache.parquet.column.values.fallback.FallbackValuesWriter.writeBytes(FallbackValuesWriter.java:162)
at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:203)
at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:347)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:257)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167)
at org.apache.parquet.avro.AvroWriteSupport.writeRecord(AvroWriteSupport.java:149)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:262)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167)
at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:142)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
at org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat$LazyRecordWriter.write(LazyOutputFormat.java:115)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:457)
at com.visa.dps.mapreduce.logger.LoggerMapper.map(LoggerMapper.java:271)

View solution in original post

1 REPLY 1

avatar
Champion Alumni

I tried using AvroParquetOutputFormat and MultipleOutputs class and was able to generate parquet files for a specific schema type.For the other schema type I am running into the below error.Any help is appreciated?

 

java.lang.ArrayIndexOutOfBoundsException: 2820
at org.apache.parquet.io.api.Binary.hashCode(Binary.java:489)
at org.apache.parquet.io.api.Binary.access$100(Binary.java:34)
at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.hashCode(Binary.java:382)
at org.apache.parquet.it.unimi.dsi.fastutil.objects.Object2IntLinkedOpenHashMap.getInt(Object2IntLinkedOpenHashMap.java:587)
at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.writeBytes(DictionaryValuesWriter.java:235)
at org.apache.parquet.column.values.fallback.FallbackValuesWriter.writeBytes(FallbackValuesWriter.java:162)
at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:203)
at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:347)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:257)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167)
at org.apache.parquet.avro.AvroWriteSupport.writeRecord(AvroWriteSupport.java:149)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:262)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167)
at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:142)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
at org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat$LazyRecordWriter.write(LazyOutputFormat.java:115)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:457)
at com.visa.dps.mapreduce.logger.LoggerMapper.map(LoggerMapper.java:271)