Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What are the best practices to get data from Apache Kafka to Apache Hive with HDFS as underlying distributed storage system?

Highlighted

What are the best practices to get data from Apache Kafka to Apache Hive with HDFS as underlying distributed storage system?

New Contributor

Hi All,

I am in the process of building a data pipeline between Apache Kafka and Apache Hive with HDFS as the underlying distributed data storage. I would like to know the best practices ad documentation for the above data pipeline.

I am using an Hortonworks HDP Cluster.

2 REPLIES 2

Re: What are the best practices to get data from Apache Kafka to Apache Hive with HDFS as underlying distributed storage system?

Super Guru
@Vamshi Reddy

If you are using NiFi then it would be very easy to get the data from kafka and store into Hive.

1.For consuming the data you can use Consume Kafka processor.
2.To store data into hive table directly then use PutHiveStreaming processor.

Puthivestreaming processor expects the incoming data in AVRO format and table needs to Transactional enabled, so based on the KafkaConsumer format of data use ConvertRecord processor to Convert the source data into AVRO format then feed the Avro data into PutHiveStreaming processor.

Flow:

1.ConsumeKafka
2.ConvertRecord//convert the outgoing flowfile into AVRO format
3.PutHiveStreaming

Refer to this link for hive transactional tables and this link for ConvertRecord processor usage.

another way would be using Kafka Connect , as this way also having hive integration in built.

-

If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.

Re: What are the best practices to get data from Apache Kafka to Apache Hive with HDFS as underlying distributed storage system?

Super Collaborator

Hive Streaming tables need to be ORC, right?

Do the Avro records automatically get converted?