Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Storm XML parser

Highlighted

Storm XML parser

Contributor

We need to parse a real-time feed of XML documents.

Using Storm, what is the best approach to process a real-time XML feed in an input parser bolt? JAXB?

Best practices. Both good and bad. Pros and cons.

Kinesis > Input XML Parser Bolt > Other Bolts

7 REPLIES 7

Re: Storm XML parser

@Jon Maestas. How big are the messages?

Re: Storm XML parser

Contributor

Hi @Scott Shaw,

The XMLs are 50-75kb (700 elements).

Re: Storm XML parser

Contributor

We receive around 40 messages per sec in this topology.

Re: Storm XML parser

@Jon Maestas and also what is the arrival rate?

Re: Storm XML parser

Contributor

We receive around 40 messages per sec.

Re: Storm XML parser

New Contributor

JAXB might be a little heavy in this case, mapping parsed XML constructs to Java objects and fields that you may not even need (hard to tell without a sample and not knowing what data you are extracting).

You may want to try a more lightweight parser like DOM or SAX.

The message size and arrival rate you mention should be trivial for Storm to handle. But be aware that those "Other Bolts" in your topology will have an impact on the overall throughput of the topology depending on what they do, and what other systems they interact with.

I would suggest testing your XML parser bolt in isolation to measure performance (i.e. remove the other bolts from the topology for performance testing). That will give you a better idea of the performance of the XML parser.

Re: Storm XML parser

New Contributor

Also, if the parser you are using is threadsafe and reusable, set it up in the prepare() method of the bolt. Some parsers, while expensive to setup, are fully thread safe and can be used across multiple threads.