what is the preferred way of storing protobuf encoded data in HDFS. Currently I see two possible solutions:
a) sequence files: storing the serialized/encoded binary data, i.e., the "byte" in the corresponding value of a sequence file.
b) Parquet: Parquet provides protobuf/Parquet converters. So, my assumption is that when using those converters the binary data first must be deserialized into an object represenation and afterwards that object must be passed to the protobuf/Parquet converter to store it in Parquet. I assume doing so will result in higher performance costs compared to solution a). As I have to process an high amount of small protobuf encoded data chunks (streamed vehicle data which are provided by Kafka) performance and memory costs are important aspects.
c) are there further alternatives?
To sum up: I'm looking for a solution to store many small protobuf encoded data chunks (i.e. vehicle sensor data) in HDFS thereby leaving the raw data as much as possible untouched. However, it must be ensured that the data can be processed afterwards using Map/Reduce or Spark.
I think you are on the right track with Parquet, Depending on how you will be later accessing the data, Avro may be a better fit. Here is a kite sdk link about Avro vs Parquet. including a presentation by Ryan Blue and Dennis Dawson of Cloudera about this topic.
If you have a need to randomly access the data, HBase would be something to look into as well.