question Re: How to insert parquet file to Kafka and pass them to HDFS/Hive in Archives of Support Questions (Read Only)

How to insert parquet file to Kafka and pass them to HDFS/Hive

Mehdi_hosseinza — Thu, 18 May 2017 23:39:17 GMT

I'm designing an architecture for our data infrastructure in our company. This infrastructure will get payment and user behavioral events from Http producer (Node JS).

I've planned to use Kafka and Hive and Spark and use Avro file format for Kafka and pass them using Kafka Connect to HDFS and hive. But I've read that Avro is not preferred file format to process with Spark and I should use Parquet instead.

Now the question is that how I can generate and pass Parquet to/from Kafka? I'm confused with lot of choices here. Any advice/resource is welcomed. 🙂

Re: How to insert parquet file to Kafka and pass them to HDFS/Hive

bkosaraju — Mon, 22 May 2017 10:27:53 GMT

Hi @Mehdi Hosseinzadeh,

From the requirements prospective, following is the simplistic approach which will be inline with technologies which you proposed.

Read the data From HTTP using Spark Streaming job and write into Kafka
Read & process data from Kafka Topic as batches/stream save the data into HDFS as parquet / Avaro /ORC etc..
Build an external Tables in Hive(on top of the data which processed in step 2) so that data is available as and when it is placed in HDFS

Accessing the data from external tables has been discussed here