Reply
Highlighted
Champion Alumni
Posts: 161
Registered: ‎02-11-2014
Accepted Solution

Re writing Avro map reduce to Parquet map reduce

Hello All,

 

We have a  java map reduce application which reads in binary files does some  data processing and converts to avro data.Currently we have two avro schemas and use Avromultipleoutputs class to write to multiple locations based on the schema.After we did  some  research we found that it would be   beneficial if we could store the data as parquet.What is the best way to do this?Should I change the native map reduce  to convert from avro to parquet or is there some other utility that I can use?.

 

Thanks,

Nishan

Champion Alumni
Posts: 161
Registered: ‎02-11-2014

Re: Re writing Avro map reduce to Parquet map reduce

I tried using AvroParquetOutputFormat and MultipleOutputs class and was able to generate parquet files for a specific schema type.For the other schema type I am running into the below error.Any help is appreciated?

 

java.lang.ArrayIndexOutOfBoundsException: 2820
at org.apache.parquet.io.api.Binary.hashCode(Binary.java:489)
at org.apache.parquet.io.api.Binary.access$100(Binary.java:34)
at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.hashCode(Binary.java:382)
at org.apache.parquet.it.unimi.dsi.fastutil.objects.Object2IntLinkedOpenHashMap.getInt(Object2IntLinkedOpenHashMap.java:587)
at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.writeBytes(DictionaryValuesWriter.java:235)
at org.apache.parquet.column.values.fallback.FallbackValuesWriter.writeBytes(FallbackValuesWriter.java:162)
at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:203)
at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:347)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:257)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167)
at org.apache.parquet.avro.AvroWriteSupport.writeRecord(AvroWriteSupport.java:149)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:262)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167)
at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:142)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
at org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat$LazyRecordWriter.write(LazyOutputFormat.java:115)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:457)
at com.visa.dps.mapreduce.logger.LoggerMapper.map(LoggerMapper.java:271)

Announcements
The Kite SDK is a collection of docs, sample code, APIs, and tools to make Hadoop application development faster. Learn more at http://kitesdk.org.