I am trying to enrich syslog data at scale with NIFI. I would like to do a lookup based on a source IP address in the syslog message against a table to retrieve a hostname.
The lookup could either be a DNS query or I could maintain a file mapping. The key thing is that this needs to be a scalable solution across millions of messages a day. It does not seem feasible to do a DNS query for every single syslog message based on each syslog message unless there was some caching done. That doesn't seem to be a part of the processor.
I also considered using ReplaceTextWithMapping and maintain a table myself. But this doesn't seem to work because it will not even match correctly using a simple regex and also it seems you can only replace directly the text you match. You cannot easily insert into another part of the message.
Is there an approach I am missing based on my lack of understanding?
Hi @Omid Krabbe
You could point the NiFi processor at a caching dns server that is pointing at the upstream nameservers you want to use.
This would allow you to benefit from caching without needing to implement implement it yourself. If your volume is high enough, you could even have a dnsmasq setup on each node and have each instance use its local dnsmasq.
Another alternative would be to use the PutDistributedMapCache and FetchDistributedMapCache processors to cache the lookups.
Thanks for the answer Bryan. How big would a cluster have to be to handle say 200 million events? I was just thinking if it were possible to maintain some sort of in memory cache on each instance it would be dramatically more efficient.
In addition to the processors pointed out by @brosander, there is also the QueryDNS processor which may be of use.