I am in the process of building a data pipeline between Apache Kafka and Apache Hive with HDFS as the underlying distributed data storage. I would like to know the best practices ad documentation for the above data pipeline.
I am using an Hortonworks HDP Cluster.
If you are using NiFi then it would be very easy to get the data from kafka and store into Hive.
1.For consuming the data you can use Consume Kafka processor.
2.To store data into hive table directly then use PutHiveStreaming processor.
Puthivestreaming processor expects the incoming data in AVRO format and table needs to Transactional enabled, so based on the KafkaConsumer format of data use ConvertRecord processor to Convert the source data into AVRO format then feed the Avro data into PutHiveStreaming processor.
2.ConvertRecord//convert the outgoing flowfile into AVRO format
another way would be using Kafka Connect , as this way also having hive integration in built.
If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.
Hive Streaming tables need to be ORC, right?
Do the Avro records automatically get converted?