Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

What are the best practices to get data from Apache Kafka to Apache Hive with HDFS as underlying distributed storage system?

New Contributor

Hi All,

I am in the process of building a data pipeline between Apache Kafka and Apache Hive with HDFS as the underlying distributed data storage. I would like to know the best practices ad documentation for the above data pipeline.

I am using an Hortonworks HDP Cluster.


Super Guru
@Vamshi Reddy

If you are using NiFi then it would be very easy to get the data from kafka and store into Hive.

1.For consuming the data you can use Consume Kafka processor.
2.To store data into hive table directly then use PutHiveStreaming processor.

Puthivestreaming processor expects the incoming data in AVRO format and table needs to Transactional enabled, so based on the KafkaConsumer format of data use ConvertRecord processor to Convert the source data into AVRO format then feed the Avro data into PutHiveStreaming processor.


2.ConvertRecord//convert the outgoing flowfile into AVRO format

Refer to this link for hive transactional tables and this link for ConvertRecord processor usage.

another way would be using Kafka Connect , as this way also having hive integration in built.


If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.

Super Collaborator

Hive Streaming tables need to be ORC, right?

Do the Avro records automatically get converted?

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.