Created 06-27-2017 04:30 PM
We need to parse a real-time feed of XML documents.
Using Storm, what is the best approach to process a real-time XML feed in an input parser bolt? JAXB?
Best practices. Both good and bad. Pros and cons.
Kinesis > Input XML Parser Bolt > Other Bolts
Created 06-27-2017 07:31 PM
@Jon Maestas. How big are the messages?
Created 06-27-2017 07:35 PM
Hi @Scott Shaw,
The XMLs are 50-75kb (700 elements).
Created 06-27-2017 07:41 PM
We receive around 40 messages per sec in this topology.
Created 06-27-2017 07:40 PM
@Jon Maestas and also what is the arrival rate?
Created 06-27-2017 08:16 PM
We receive around 40 messages per sec.
Created 06-27-2017 08:02 PM
JAXB might be a little heavy in this case, mapping parsed XML constructs to Java objects and fields that you may not even need (hard to tell without a sample and not knowing what data you are extracting).
You may want to try a more lightweight parser like DOM or SAX.
The message size and arrival rate you mention should be trivial for Storm to handle. But be aware that those "Other Bolts" in your topology will have an impact on the overall throughput of the topology depending on what they do, and what other systems they interact with.
I would suggest testing your XML parser bolt in isolation to measure performance (i.e. remove the other bolts from the topology for performance testing). That will give you a better idea of the performance of the XML parser.
Created 06-28-2017 12:33 AM
Also, if the parser you are using is threadsafe and reusable, set it up in the prepare() method of the bolt. Some parsers, while expensive to setup, are fully thread safe and can be used across multiple threads.