Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

gzip compression for hdfs sink gives Not an Avro data file

Highlighted

gzip compression for hdfs sink gives Not an Avro data file

Explorer

Hello,

 

below is my configuration which works perfectly fine for non compressed data

 

agent.sinks.test.type = hdfs
agent.sinks.test.hdfs.useLocalTimeStamp = true
agent.sinks.test.hdfs.path = s3n://AccessKeys@test/%{topic}/utc=%s
agent.sinks.test.hdfs.roundUnit = minute
agent.sinks.test.hdfs.round = true
agent.sinks.test.hdfs.roundValue = 10
agent.sinks.test.hdfs.fileSuffix = .avro
agent.sinks.test.serializer = com.test.flume.sink.serializer.GenericRecordAvroEventSerializer$Builder
agent.sinks.test.hdfs.fileType = DataStream
agent.sinks.test.hdfs.maxOpenFiles=100
agent.sinks.test.hdfs.appendTimeout = 5000
agent.sinks.test.hdfs.callTimeout = 4000
agent.sinks.test.hdfs.rollInterval = 60
agent.sinks.test.hdfs.rollSize = 0 
agent.sinks.test.hdfs.rollCount = 1000
agent.sinks.test.hdfs.batchSize = 1000
agent.sinks.test.hdfs.threadsPoolSize=100

 

I am trying to add compression using gzip to this as follows

 

agent.sinks.test.type = hdfs
agent.sinks.test.hdfs.useLocalTimeStamp = true
agent.sinks.test.hdfs.path = s3n://AccessKeys@test/%{topic}/utc=%s
agent.sinks.test.hdfs.roundUnit = minute
agent.sinks.test.hdfs.round = true
agent.sinks.test.hdfs.roundValue = 10
agent.sinks.test.hdfs.fileSuffix = .avro
agent.sinks.test.serializer = com.test.flume.sink.serializer.GenericRecordAvroEventSerializer$Builder
agent.sinks.test.hdfs.fileType = CompressedStream
agent.sinks.test.hdfs.codeC = gzip
agent.sinks.test.hdfs.maxOpenFiles=100
agent.sinks.test.hdfs.appendTimeout = 10000
agent.sinks.test.hdfs.callTimeout = 4000
agent.sinks.test.hdfs.rollInterval = 60
agent.sinks.test.hdfs.rollSize = 0
agent.sinks.test.hdfs.rollCount = 1000
agent.sinks.test.hdfs.batchSize = 1000
agent.sinks.test.hdfs.threadsPoolSize=100

 

All the above data is being stored in s3 and when i try to retrieve the data from hive i am getting the below weerror.

Exception in thread "main" java.io.IOException: Not an Avro data file

 

Can you please let me know why is my configuration not working ?