Archives of Support Questions (Read Only)

Nishan · ‎12-04-2015

Hello All,

We have a java map reduce application which reads in binary files does some data processing and converts to avro data.Currently we have two avro schemas and use Avromultipleoutputs class to write to multiple locations based on the schema.After we did some research we found that it would be beneficial if we could store the data as parquet.What is the best way to do this?Should I change the native map reduce to convert from avro to parquet or is there some other utility that I can use?.

Thanks,

Nishan

Nishan · ‎12-04-2015

I tried using AvroParquetOutputFormat and MultipleOutputs class and was able to generate parquet files for a specific schema type.For the other schema type I am running into the below error.Any help is appreciated?

java.lang.ArrayIndexOutOfBoundsException: 2820
at org.apache.parquet.io.api.Binary.hashCode(Binary.java:489)
at org.apache.parquet.io.api.Binary.access$100(Binary.java:34)
at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.hashCode(Binary.java:382)
at org.apache.parquet.it.unimi.dsi.fastutil.objects.Object2IntLinkedOpenHashMap.getInt(Object2IntLinkedOpenHashMap.java:587)
at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.writeBytes(DictionaryValuesWriter.java:235)
at org.apache.parquet.column.values.fallback.FallbackValuesWriter.writeBytes(FallbackValuesWriter.java:162)
at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:203)
at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:347)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:257)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167)
at org.apache.parquet.avro.AvroWriteSupport.writeRecord(AvroWriteSupport.java:149)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:262)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167)
at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:142)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
at org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat$LazyRecordWriter.write(LazyOutputFormat.java:115)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:457)
at com.visa.dps.mapreduce.logger.LoggerMapper.map(LoggerMapper.java:271)

View solution in original post

Nishan · ‎12-04-2015

I tried using AvroParquetOutputFormat and MultipleOutputs class and was able to generate parquet files for a specific schema type.For the other schema type I am running into the below error.Any help is appreciated?

java.lang.ArrayIndexOutOfBoundsException: 2820
at org.apache.parquet.io.api.Binary.hashCode(Binary.java:489)
at org.apache.parquet.io.api.Binary.access$100(Binary.java:34)
at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.hashCode(Binary.java:382)
at org.apache.parquet.it.unimi.dsi.fastutil.objects.Object2IntLinkedOpenHashMap.getInt(Object2IntLinkedOpenHashMap.java:587)
at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.writeBytes(DictionaryValuesWriter.java:235)
at org.apache.parquet.column.values.fallback.FallbackValuesWriter.writeBytes(FallbackValuesWriter.java:162)
at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:203)
at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:347)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:257)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167)
at org.apache.parquet.avro.AvroWriteSupport.writeRecord(AvroWriteSupport.java:149)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:262)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167)
at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:142)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
at org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat$LazyRecordWriter.write(LazyOutputFormat.java:115)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:457)
at com.visa.dps.mapreduce.logger.LoggerMapper.map(LoggerMapper.java:271)

Cloudera Community

Archives of Support Questions (Read Only)

Re writing Avro map reduce to Parquet map reduce