Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Operating model opinion

Operating model opinion

Explorer

I am debating a couple different architectures for processing analytics against e-mail content (fed via EML files), and I was curious if someone can give me some feedback.

Model 1:

Agent (windows) --> Receiver (Linux) --> Tika --> Kafka --> Spark --> entity extraction

Model 2:

Agent (windows) --> Receiver (Linux) --> Kafka --> Spark --> Tika --> entity extraction

I feel, and correct me if I am wrong, but if I do text extraction via tika, and segment the eml into the individual components, it would perform better in kafka's queue process for spark to consume it. If doing Tika after (via spark) I feel the block processing model would get hung up and slow the rest of the processes down. A developer I spoke to feels it's the other way around due to sparks processing model in the cluster, but my understanding is Tika doesn't operate within the distributed processing cluster no matter what.

Please let me know if I am incorrect or if the models above are not good.

1 REPLY 1
Highlighted

Re: Operating model opinion

Nifi might be a good fit as well depending on how big the email messages are that you are expecting. If the messages are basically text with no attachments Kafka should be fine, but large attachments could cause performance issues and may need special handling.

http://www.slideshare.net/JiangjieQin/handle-large-messages-in-apache-kafka-58692297

Nifi is a component in Hortonworks Data Flow:

http://hortonworks.com/products/data-center/hdf/

Tika is a java library that could be embedded in Spark Streaming, Storm, Map Reduce, or Spark on HDFS.

http://events.linuxfoundation.org/sites/events/files/slides/ACNA15_Mattmann_Tika_Video2.pdf

Don't have an account?
Coming from Hortonworks? Activate your account here