Reply
Expert Contributor
Posts: 139
Registered: ‎07-21-2014

Kafka->HDFS pipeline

Is there any documentation on how to use Kafka to write to HDFS? I'm aware of Camus but not sure how to set it up in the CDH environment. It would also be great if you can provide how to consume from Kafka (JSON or other formats) and write HDFS in Parquet format.

 

Thanks!

Highlighted
Cloudera Employee
Posts: 3
Registered: ‎10-19-2014

Re: Kafka->HDFS pipeline

Hi Buntu,

 

One way of getting data from Kafka to HDFS/HBase is via Flume i.e. Kafka --> Flume --> HDFS/HBase

 

You can use Flume's Kafka-Source to read from Kafka (https://issues.apache.org/jira/browse/FLUME-2250), and then use Flume sinks to write to HDFS/HBase. Flume's Kafka-Source is available in CDH 5.2.

 

Flume can not write directly to Parquet. You can do the conversion to parquet using the Kite SDK or by following the instructions here

 

 

Cloudera Employee
Posts: 20
Registered: ‎07-08-2013

Re: Kafka->HDFS pipeline

You can find step by step instructions how to configure Flume to read from Kafka in Gwen's Flume or Kafka? Try both! blog.

Expert Contributor
Posts: 139
Registered: ‎07-21-2014

Re: Kafka->HDFS pipeline

Thanks for the info.

 

I couldn't find how to convert to Parquet format using Kite SDK, any pointers would be helpful.