Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Best practices for storing lots of small protobuf encoded data in HDFS

Best practices for storing lots of small protobuf encoded data in HDFS

Explorer

Hi,

what is the preferred way of storing protobuf encoded data in HDFS. Currently I see two possible solutions:
a) sequence files: storing the serialized/encoded binary data, i.e., the "byte[]" in the corresponding value of a sequence file.


b) Parquet: Parquet provides protobuf/Parquet converters. So, my assumption is that when using those converters the binary data first must be deserialized into an object represenation and afterwards that object must be passed to the protobuf/Parquet converter to store it in Parquet. I assume doing so will result in higher performance costs compared to solution a). As I have to process an high amount of small protobuf encoded data chunks (streamed vehicle data which are provided by Kafka) performance and memory costs are important aspects.

c) are there further alternatives?

To sum up: I'm looking for a solution to store many small protobuf encoded data chunks (i.e. vehicle sensor data) in HDFS thereby leaving the raw data as much as possible untouched. However, it must be ensured that the data can be processed afterwards using Map/Reduce or Spark.

 

Best, Thomas

1 REPLY 1
Highlighted

Re: Best practices for storing lots of small protobuf encoded data in HDFS

Expert Contributor

I think you are on the right track with Parquet, Depending on how you will be later accessing the data, Avro may be a better fit. Here is a kite sdk link[1] about Avro vs Parquet. including a presentation by Ryan Blue and Dennis Dawson of Cloudera about this topic.  

 

If you have a need to randomly access the data, HBase would be something to look into as well.

 

[1]http://kitesdk.org/docs/1.0.0/Parquet-vs-Avro-Format.html

 

 

Don't have an account?
Coming from Hortonworks? Activate your account here