We need to parse a real-time feed of XML documents.
Using Storm, what is the best approach to process a real-time XML feed in an input parser bolt? JAXB?
Best practices. Both good and bad. Pros and cons.
Kinesis > Input XML Parser Bolt > Other Bolts
JAXB might be a little heavy in this case, mapping parsed XML constructs to Java objects and fields that you may not even need (hard to tell without a sample and not knowing what data you are extracting).
You may want to try a more lightweight parser like DOM or SAX.
The message size and arrival rate you mention should be trivial for Storm to handle. But be aware that those "Other Bolts" in your topology will have an impact on the overall throughput of the topology depending on what they do, and what other systems they interact with.
I would suggest testing your XML parser bolt in isolation to measure performance (i.e. remove the other bolts from the topology for performance testing). That will give you a better idea of the performance of the XML parser.
Also, if the parser you are using is threadsafe and reusable, set it up in the prepare() method of the bolt. Some parsers, while expensive to setup, are fully thread safe and can be used across multiple threads.