Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to insert parquet file to Kafka and pass them to HDFS/Hive

Solved Go to solution

How to insert parquet file to Kafka and pass them to HDFS/Hive

New Contributor

I'm designing an architecture for our data infrastructure in our company. This infrastructure will get payment and user behavioral events from Http producer (Node JS).

I've planned to use Kafka and Hive and Spark and use Avro file format for Kafka and pass them using Kafka Connect to HDFS and hive. But I've read that Avro is not preferred file format to process with Spark and I should use Parquet instead.

Now the question is that how I can generate and pass Parquet to/from Kafka? I'm confused with lot of choices here. Any advice/resource is welcomed. :)

1 ACCEPTED SOLUTION

Accepted Solutions

Re: How to insert parquet file to Kafka and pass them to HDFS/Hive

Super Collaborator

Hi @Mehdi Hosseinzadeh,

From the requirements prospective, following is the simplistic approach which will be inline with technologies which you proposed.

  1. Read the data From HTTP using Spark Streaming job and write into Kafka
  2. Read & process data from Kafka Topic as batches/stream save the data into HDFS as parquet / Avaro /ORC etc..
  3. Build an external Tables in Hive(on top of the data which processed in step 2) so that data is available as and when it is placed in HDFS

Accessing the data from external tables has been discussed here

1 REPLY 1

Re: How to insert parquet file to Kafka and pass them to HDFS/Hive

Super Collaborator

Hi @Mehdi Hosseinzadeh,

From the requirements prospective, following is the simplistic approach which will be inline with technologies which you proposed.

  1. Read the data From HTTP using Spark Streaming job and write into Kafka
  2. Read & process data from Kafka Topic as batches/stream save the data into HDFS as parquet / Avaro /ORC etc..
  3. Build an external Tables in Hive(on top of the data which processed in step 2) so that data is available as and when it is placed in HDFS

Accessing the data from external tables has been discussed here