Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Streaming tweets with kafka or not.

New Contributor

I'm newbie in Hadoop ecosystem.
I wanna do a project where I stream some tweets to analyze them in Hive, all this process have to be done in HDF/NiFi. The project must be scalable.
I saw here that people adopt two different flow strategist.

1.) Get the tweets ---> Put them into the HDFS ---> analyze with Hive

2.) Get the tweets ---> Stream with Kafka(publish/consumer) ---> Put them into the HDFS ---> Analyze with Hive

SO, my question is what's the difference? the first strategy isn't scalable?
Which strategy would you follow?
Thank you.


@Ivan_M93 Great question!!!


I personally prefer the Kafka method.  This allows nifi to scale against a separate scale of Kafka (assuming not in same cluster).  This also decouples the data from hdfs (which is also usually separate from nifi) at the point of ingestion.


With this method you have tons of options to decouple the processing after ingestion and take advantage of the basics of Kafka (pub/sub) to avoid processing duplicate data.


That said the other option is arguably good as well as it can be done without Kafka.  The decision then becomes what components you have available and comfortable to your team.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.