I am debating a couple different architectures for processing analytics against e-mail content (fed via EML files), and I was curious if someone can give me some feedback.
Agent (windows) --> Receiver (Linux) --> Tika --> Kafka --> Spark --> entity extraction
Agent (windows) --> Receiver (Linux) --> Kafka --> Spark --> Tika --> entity extraction
I feel, and correct me if I am wrong, but if I do text extraction via tika, and segment the eml into the individual components, it would perform better in kafka's queue process for spark to consume it. If doing Tika after (via spark) I feel the block processing model would get hung up and slow the rest of the processes down. A developer I spoke to feels it's the other way around due to sparks processing model in the cluster, but my understanding is Tika doesn't operate within the distributed processing cluster no matter what.
Please let me know if I am incorrect or if the models above are not good.
Nifi might be a good fit as well depending on how big the email messages are that you are expecting. If the messages are basically text with no attachments Kafka should be fine, but large attachments could cause performance issues and may need special handling.
Nifi is a component in Hortonworks Data Flow:
Tika is a java library that could be embedded in Spark Streaming, Storm, Map Reduce, or Spark on HDFS.