Support Questions

Mehdi_hosseinza · ‎05-18-2017

I'm designing an architecture for our data infrastructure in our company. This infrastructure will get payment and user behavioral events from Http producer (Node JS).

I've planned to use Kafka and Hive and Spark and use Avro file format for Kafka and pass them using Kafka Connect to HDFS and hive. But I've read that Avro is not preferred file format to process with Spark and I should use Parquet instead.

Now the question is that how I can generate and pass Parquet to/from Kafka? I'm confused with lot of choices here. Any advice/resource is welcomed. 🙂

bkosaraju · ‎05-22-2017

Hi @Mehdi Hosseinzadeh,

From the requirements prospective, following is the simplistic approach which will be inline with technologies which you proposed.

Read the data From HTTP using Spark Streaming job and write into Kafka
Read & process data from Kafka Topic as batches/stream save the data into HDFS as parquet / Avaro /ORC etc..
Build an external Tables in Hive(on top of the data which processed in step 2) so that data is available as and when it is placed in HDFS

Accessing the data from external tables has been discussed here

View solution in original post

bkosaraju · ‎05-22-2017

Hi @Mehdi Hosseinzadeh,

From the requirements prospective, following is the simplistic approach which will be inline with technologies which you proposed.

Read the data From HTTP using Spark Streaming job and write into Kafka
Read & process data from Kafka Topic as batches/stream save the data into HDFS as parquet / Avaro /ORC etc..
Build an external Tables in Hive(on top of the data which processed in step 2) so that data is available as and when it is placed in HDFS

Accessing the data from external tables has been discussed here

Cloudera Community

Support Questions

How to insert parquet file to Kafka and pass them to HDFS/Hive