I'm newbie in Hadoop ecosystem.
I wanna do a project where I stream some tweets to analyze them in Hive, all this process have to be done in HDF/NiFi. The project must be scalable.
I saw here that people adopt two different flow strategist.
1.) Get the tweets ---> Put them into the HDFS ---> analyze with Hive
2.) Get the tweets ---> Stream with Kafka(publish/consumer) ---> Put them into the HDFS ---> Analyze with Hive
SO, my question is what's the difference? the first strategy isn't scalable?
Which strategy would you follow?
@Ivan_M93 Great question!!!
I personally prefer the Kafka method. This allows nifi to scale against a separate scale of Kafka (assuming not in same cluster). This also decouples the data from hdfs (which is also usually separate from nifi) at the point of ingestion.
With this method you have tons of options to decouple the processing after ingestion and take advantage of the basics of Kafka (pub/sub) to avoid processing duplicate data.
That said the other option is arguably good as well as it can be done without Kafka. The decision then becomes what components you have available and comfortable to your team.
If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post.
Steven @ DFHZ