Created 08-03-2016 04:31 PM
In theory, design wise, i've been debating with a developer on an architecture using Spark and Kafka for a data processor.
In my mind, I prefer a model where data comes into a queue in Kafka, gets picked up by a dispatcher service (spark) and distributed into another Kafka queue based on the content delivered. He prefers a single Kafka queue, where the spark application does all the individual extraction/disassembly of data.
My argument for the multiple queues is I can divide up data that is multi-type (emails for example), and distribute them into queues to be more efficient. He feels that a single Spark application can do that more efficiently based on the distributed model.
Which one is better? Is one better?
Created 08-03-2016 04:44 PM
So he is definitely correct if you have a single application when you purely look at performance. You essentially get rid of the second kafka overhead.
However there are situations where a second set of queues can be helpful:
You decouple parsing and cleansing from analysis which has big advantages:
- So you can have one parser app and then multiple analytical applications you can start and stop as you please without impacting the other parser and analytical apps.
- You can write simple analytical applications that take a parsed, cleaned subset of data they are interested in so people can consume data they actually want and don't have to worry about the initial parsing/cleansing etc.
Created 08-03-2016 04:44 PM
So he is definitely correct if you have a single application when you purely look at performance. You essentially get rid of the second kafka overhead.
However there are situations where a second set of queues can be helpful:
You decouple parsing and cleansing from analysis which has big advantages:
- So you can have one parser app and then multiple analytical applications you can start and stop as you please without impacting the other parser and analytical apps.
- You can write simple analytical applications that take a parsed, cleaned subset of data they are interested in so people can consume data they actually want and don't have to worry about the initial parsing/cleansing etc.
Created 08-03-2016 04:51 PM
We have multiple applications which need to chew on the the data, but we also have several processes to normalize data (that have to run based on the source data format). For example, a document feed needs to hit tika, but a news feed does not (RSS content). We also have to process e-mails, and an e-mail could go down the path of needing tika for attachments if they are present. Based on that logic, do you still feel a single queue would work sufficient and have all that decision tree/disassembly in a single spark app for efficiency?
Created 08-04-2016 10:22 AM
Puh I think you need to make that decision yourself. A single big application will always be more efficient i.e. faster. You can also modularize a Spark project so working on a single task doesn't change the code from the others.
However it becomes complex and you as said need to stop/start the application whenever you make a change to any part of it. Also if you use something CPU heavy as Tika the overhead of a separate topic in the middle doesn't sound too big anymore. So I think I would also strive for something like
Input sources -> Parsing, Tika, normalization -> Kafka with a normalized document format ( JSON? ) -> Analytical application.
But its hard to make an intelligent answer from a mile away 🙂 .
Created 08-04-2016 12:18 PM
Thanks, the mile away view was all i needed, but you also put another model into my head about parsing/normalization before even going into kafka.