I'm newbie in Hadoop ecosystem. I wanna do a project where I stream some tweets to analyze them in Hive, all this process have to be done in HDF/NiFi. The project must be scalable. I saw here that people adopt two different flow strategist.
1.) Get the tweets ---> Put them into the HDFS ---> analyze with Hive
2.) Get the tweets ---> Stream with Kafka(publish/consumer) ---> Put them into the HDFS ---> Analyze with Hive
SO, my question is what's the difference? the first strategy isn't scalable? Which strategy would you follow? Thank you.
I personally prefer the Kafka method. This allows nifi to scale against a separate scale of Kafka (assuming not in same cluster). This also decouples the data from hdfs (which is also usually separate from nifi) at the point of ingestion.
With this method you have tons of options to decouple the processing after ingestion and take advantage of the basics of Kafka (pub/sub) to avoid processing duplicate data.
That said the other option is arguably good as well as it can be done without Kafka. The decision then becomes what components you have available and comfortable to your team.