Re writing Avro map reduce to Parquet map reduce

Nishan — Fri, 04 Dec 2015 18:16:35 GMT

Hello All,

We have a java map reduce application which reads in binary files does some data processing and converts to avro data.Currently we have two avro schemas and use Avromultipleoutputs class to write to multiple locations based on the schema.After we did some research we found that it would be beneficial if we could store the data as parquet.What is the best way to do this?Should I change the native map reduce to convert from avro to parquet or is there some other utility that I can use?.

Thanks,

Nishan

Re: Re writing Avro map reduce to Parquet map reduce

Nishan — Fri, 04 Dec 2015 20:46:28 GMT

I tried using AvroParquetOutputFormat and MultipleOutputs class and was able to generate parquet files for a specific schema type.For the other schema type I am running into the below error.Any help is appreciated?

java.lang.ArrayIndexOutOfBoundsException: 2820
at org.apache.parquet.io.api.Binary.hashCode(Binary.java:489)
at org.apache.parquet.io.api.Binary.access$100(Binary.java:34)
at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.hashCode(Binary.java:382)
at org.apache.parquet.it.unimi.dsi.fastutil.objects.Object2IntLinkedOpenHashMap.getInt(Object2IntLinkedOpenHashMap.java:587)
at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.writeBytes(DictionaryValuesWriter.java:235)
at org.apache.parquet.column.values.fallback.FallbackValuesWriter.writeBytes(FallbackValuesWriter.java:162)
at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:203)
at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:347)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:257)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167)
at org.apache.parquet.avro.AvroWriteSupport.writeRecord(AvroWriteSupport.java:149)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:262)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167)
at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:142)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
at org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat$LazyRecordWriter.write(LazyOutputFormat.java:115)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:457)
at com.visa.dps.mapreduce.logger.LoggerMapper.map(LoggerMapper.java:271)

question Re writing Avro map reduce to Parquet map reduce in Archives of Support Questions (Read Only)

Re writing Avro map reduce to Parquet map reduce

Re: Re writing Avro map reduce to Parquet map reduce