Wanted to know how good Kafka is to capture events at various states during the internal data processing. That info can be used for auditing or reporting purpose. Suppose the data consumption has been started and I want to know the number of input records processed and the number of records loaded in Hive. In Hive, there is some kind of enrichment going on. I want to know how many records got enriched.
Plan is to load them eventually into HBase. Also, message volumes would be very low at this point of time. Just want to decouple these tasks from other framework jobs. Please let me know if you have ever come across the idea of using pub/sub in this kind of scenario.
You can use the consumer group information (offsets) from Kafka to inform you on how much data has been processed. This information is fairly reliable to be used for reporting purposes.
Please accept this answer if it helped you.
Thanks @Ambud Sharma for your reply. The actual data will not be processed or sent to Kafka brokers. It would be there in Hive only. I want some specific information for the event like:
1. when did the job start
2. did it get fail due to some exception
3. if not, then how many records got processed
Want to send this data to Kafka. I could use simple log4j logging along with Splunk but would like to stay within HDP stack. Please let me know how you feel about this.