I'm newbie in Hadoop ecosystem.
I wanna do a project where I stream some tweets to analyze them in Hive, all this process have to be done in HDF/NiFi. The project must be scalable.
I saw here that people adopt two different flow strategist.
1.) Get the tweets ---> Put them into the HDFS ---> analyze with Hive
2.) Get the tweets ---> Stream with Kafka(publish/consumer) ---> Put them into the HDFS ---> Analyze with Hive
SO, my question is what's the difference? the first strategy isn't scalable?
Which strategy would you follow?
Thank you.