Support Questions
Find answers, ask questions, and share your expertise

Can Kafka be used for events processing?

Super Collaborator

Hi guys,

Wanted to know how good Kafka is to capture events at various states during the internal data processing. That info can be used for auditing or reporting purpose. Suppose the data consumption has been started and I want to know the number of input records processed and the number of records loaded in Hive. In Hive, there is some kind of enrichment going on. I want to know how many records got enriched.

Plan is to load them eventually into HBase. Also, message volumes would be very low at this point of time. Just want to decouple these tasks from other framework jobs. Please let me know if you have ever come across the idea of using pub/sub in this kind of scenario.


Rising Star

You can use the consumer group information (offsets) from Kafka to inform you on how much data has been processed. This information is fairly reliable to be used for reporting purposes.

Please accept this answer if it helped you.

Super Collaborator

Thanks @Ambud Sharma for your reply. The actual data will not be processed or sent to Kafka brokers. It would be there in Hive only. I want some specific information for the event like:

1. when did the job start

2. did it get fail due to some exception

3. if not, then how many records got processed

Want to send this data to Kafka. I could use simple log4j logging along with Splunk but would like to stay within HDP stack. Please let me know how you feel about this.

Rising Star

You can use a log4j appender for Kafka:

Another option could be to use Atlas hook:

Super Collaborator

Hi @Rafael Coss, any comments on this??

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.