My current project is in MainFrames with DB2 as its database. We have 70 databases with nearly 60 tables in each of them. Our architect proposed a plan of using Kafka with Spark streaming for processing data. How good is Kafka in reading the RDBMS tables for data ? Do we directly read the data from the tables using Kafka or is there any other way to get the data from RDBMS into Kafka ? If there is any better solution, your suggestions can help a lot.
Kafka is a message queue system. You write consumers and producers for Kafka. There are existing ones from the community and certified ones from partners. Producers read data from a source and publish it to a Kafka queue. Consumers read data a Kafka queue and write it to a destination.
I got the feeling that your solution architect is thinking of getting the data from the source that is feeding the data into DB2. I don't know what the source is but I'll use the typical twitter example. I would have a Kafka Twitter producer to get tweets from the twitter firehouse api and publish them a Kafka queue. I would then have a Spark Streaming consumer to take those tweet messages and feed them into my Spark app to filter all hashtags and aggregate top ten and bottem ten by language.
There are no tables in Kafka. Structure Streams (the new name for Spark Streaming) does have the DataFrame api so you can work on the data in a table format and SQL queries can be used in the app. But both require building applications in the choice of language; not SQL/HQL.
I couldn't recommend a better solution without going down the path of being your solution architect. If you come up with additional questions open new topics and ask them. Invite your solution architect to join this community. If you want and can divulge information like data source(s), data type(s), data format(s), usage patterns, etc. I can point you in a direction.