We're currently designing a solution for a real time data pipeline that must process 100's of thousand of msg/sec encoded with ASN.1 BER over multiple TCP Sockets (sustained ~200MB/s) and producing the decoded data to Kafka Topics as well as to Hive. We're thinking on leveraging NiFi for this workload.
The objective is to minimize the number of content transformation operations, use a format that support schemas or some sort of grammar, that is compact (space complexity), that the encode/decode operations are performant (time complexity) and that the format is supported by Hive and can be easily consumed by our external Kafka consumers. Regarding schema evolution I'm not so worried because our schemas are very stable over time.
Having said that, one of my main concerns is the processing effort to decode the ASN.1 (for which we'll need to develop a processor) and serializing the data to a compact message performant serialization format that is suitable to be placed both in Kafka, for external consumers, and Hive for Analytics purpose.
The most obvious options that come to mind:
Hive supports operations on HDFS avro serialized files
Although Kafka is content agnostic, our external kafka consumers would benefit from Avro encoded data along with schema registry facilty.
Avro binary format is somewhat compact and can also be additionaly compressed.
The drawback (and that is scaring me) is that there are a lot of benchmarks stating that the AVRO encode/decode operations are quite expensive when comparing to other serialization frameworks.
We'll definetaly will make some benchmarking ourselves, but I'm wondering if the community knows about some other design that would produce a lean pipeline that would eventually meet our requirements.