Question on scheduling the Kafka Consumer client in hadoop cluster:
I have coded a Kafka consumer client that reads the messages from a topic and writes to a local file. I want to schedule this consumer client so that it runs continuously and reads from the topic as and when the message is published in Hadoop cluster. Can someone please explain what is the standard way of doing this in hadoop cluster? I have following approach in my mind, but not sure if this is a usual way. Please let me know your thoughts or suggestions on this.
(The sample client writes to a file in local filesystem, but thats just for testing when I schedule it, I am planning to write to HDFS file and then process it later; later after sometime I am planning to write to Hbase directly from Kafka consumer)
I am thinking of creating a Oozie workflow with consumer client called using java action and submit the same workflow as many times as the number of consumers I want. I will also change the consumer to write to HDFS file instead of local file. (The HDFS filename will be appended with partition number so that two consumers dont try to write to same file).
If I follow this approach, the kafka clients on run on Yarn right? So do I have to do something specific to consumer client rebalancing? or will that work properly as usual? I am just assigning topics to the consumer, not subscribing to specific partition in consumer. Please let me know.
And generally do I have to code the Java client in any different way to run through oozie? The entire java client will be launched in a single mapper in my case correct?
You don't have to code the Java client differently when it runs through Oozie. You could use either the Oozie shell or java actions. Oozie will launch the consumer on a arbitrary (single) node as a map-only job. This is usually not a problem since the Java consumer jar is self-contained and distributable.
The fact that the consumer is run through Oozie/YARN will not make your custom consumer code run any different. So the balancing of multiple topic consumers in a consumer group is built-in in the Kafka code and you don't have to worry about that.
Thanks for your comments. Could you please also let me know if this is the usual way Kafka consumers are run in hadoop? if not, could you let me know how the consumers/producers are usually scheduled in hadoop cluster?
I don't see any benefit in scheduling it through Oozie/YARN. If it is intended to run continuously there will hardly be any job dependencies so it makes sense to run Kafka consumers straight from the shell. Remember Kafka was developed completely separate from Hadoop, so even today there is little integration between them.
Something is going on with Apache Slider though: slider_and_kafka could be something for you. It brings long running application under the YARN umbrella, Kafka as well. One benefit could be that the resource consumption of the Kafka consumer does not need to be static anymore, since YARN adds elasticity.