- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Re writing Avro map reduce to Parquet map reduce
Created ‎12-04-2015 10:16 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello All,
We have a java map reduce application which reads in binary files does some data processing and converts to avro data.Currently we have two avro schemas and use Avromultipleoutputs class to write to multiple locations based on the schema.After we did some research we found that it would be beneficial if we could store the data as parquet.What is the best way to do this?Should I change the native map reduce to convert from avro to parquet or is there some other utility that I can use?.
Thanks,
Nishan
Created ‎12-04-2015 12:46 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried using AvroParquetOutputFormat and MultipleOutputs class and was able to generate parquet files for a specific schema type.For the other schema type I am running into the below error.Any help is appreciated?
java.lang.ArrayIndexOutOfBoundsException: 2820
at org.apache.parquet.io.api.Binary.hashCode(Binary.java:489)
at org.apache.parquet.io.api.Binary.access$100(Binary.java:34)
at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.hashCode(Binary.java:382)
at org.apache.parquet.it.unimi.dsi.fastutil.objects.Object2IntLinkedOpenHashMap.getInt(Object2IntLinkedOpenHashMap.java:587)
at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.writeBytes(DictionaryValuesWriter.java:235)
at org.apache.parquet.column.values.fallback.FallbackValuesWriter.writeBytes(FallbackValuesWriter.java:162)
at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:203)
at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:347)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:257)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167)
at org.apache.parquet.avro.AvroWriteSupport.writeRecord(AvroWriteSupport.java:149)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:262)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167)
at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:142)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
at org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat$LazyRecordWriter.write(LazyOutputFormat.java:115)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:457)
at com.visa.dps.mapreduce.logger.LoggerMapper.map(LoggerMapper.java:271)
Created ‎12-04-2015 12:46 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried using AvroParquetOutputFormat and MultipleOutputs class and was able to generate parquet files for a specific schema type.For the other schema type I am running into the below error.Any help is appreciated?
java.lang.ArrayIndexOutOfBoundsException: 2820
at org.apache.parquet.io.api.Binary.hashCode(Binary.java:489)
at org.apache.parquet.io.api.Binary.access$100(Binary.java:34)
at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.hashCode(Binary.java:382)
at org.apache.parquet.it.unimi.dsi.fastutil.objects.Object2IntLinkedOpenHashMap.getInt(Object2IntLinkedOpenHashMap.java:587)
at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.writeBytes(DictionaryValuesWriter.java:235)
at org.apache.parquet.column.values.fallback.FallbackValuesWriter.writeBytes(FallbackValuesWriter.java:162)
at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:203)
at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:347)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:257)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167)
at org.apache.parquet.avro.AvroWriteSupport.writeRecord(AvroWriteSupport.java:149)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:262)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167)
at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:142)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
at org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat$LazyRecordWriter.write(LazyOutputFormat.java:115)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:457)
at com.visa.dps.mapreduce.logger.LoggerMapper.map(LoggerMapper.java:271)
