In order to understand what it would take to work with various streaming tools, I have defined this question as an umbrella for making the overview of ways to stream data.
For consistency I picked a simple reference usecase: Messages arrive from kafka, and need to be put on HDFS.
Source topic name: input
Output folder name on HDFS: output
The core usecase is picking up a bit of data from Kafka, and putting it on HDFS.
The bonus usecase is ensuring that new field C is defined by dividing fields A and B which both occur in the data, and ideally the schema would be used for this.
Subquestions:
Streaming data from Kafka to HDFS with NiFi
Streaming data from Kafka to HDFS with Flink
Streaming data from Kafka to HDFS with Flink SQL
Streaming data from Kafka to HDFS with Spark Interactive
Streaming data from Kafka to HDFS with a Spark Jar
Streaming data from Kafka to HDFS with Kafka Connect
If a substep is well documented, do not hesitate to refer to it, but please ensure the end-to-end process is documented including building and deployment.
If you notice this question is not specified well, or if there is something blocking one of the subquestions to be answered, please post a comment.
- Dennis Jaheruddin
If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.