Support Questions

m_damico · ‎02-17-2016

Assuming that I need to use Apache Storm, would it be possible doing so keeping the data in DB2? Or do I need to move the data from DB2 to HDFS (using for instance Sqoop or IBM CDC), in order to be able to use Apache Storm?

bleonhardi · ‎02-17-2016

I do not see a native way to stream data from a database in Storm. There is a JDBC connector but it is for Insert of results and lookups.

Its not impossible however. Other streaming products I worked with in the past could stream from a database. ( Essentially by specifying WHERE conditions or requerying the same table every x seconds)

So you could definitely implement a storm bolt like that. Depending on data volumes you might have to partition the load similar to Sqoop does it ( having multiple spouts that read with a where condition by some id ) or if the volumes are not too large just have a single spout.

If it is a simple single connection example I am sure you could implement it in very short time.

http://storm.apache.org/documentation/storm-jdbc.html

If I was to implement an JDBCSpout I would use the Twitter Example from here and replace the Twitter code with a JDBC connection being opened against DB2. If you read some kind of staging table that gets refreshed every x seconds you would read it completely and then check if a specific amount of time has passed ( the nextTuple method is called continuously ) . If you only want to read new tuples you would have to add a WHERE condition based on some timestamp in the DB2 table.

It has also some pointers how to make parallel Spouts in case a single connection is not fast enough.

https://github.com/storm-book/examples-ch04-spouts

https://www.safaribooksonline.com/library/view/getting-started-with/9781449324025/ch04.html

View solution in original post

bleonhardi · ‎02-17-2016

I do not see a native way to stream data from a database in Storm. There is a JDBC connector but it is for Insert of results and lookups.

Its not impossible however. Other streaming products I worked with in the past could stream from a database. ( Essentially by specifying WHERE conditions or requerying the same table every x seconds)

So you could definitely implement a storm bolt like that. Depending on data volumes you might have to partition the load similar to Sqoop does it ( having multiple spouts that read with a where condition by some id ) or if the volumes are not too large just have a single spout.

If it is a simple single connection example I am sure you could implement it in very short time.

http://storm.apache.org/documentation/storm-jdbc.html

If I was to implement an JDBCSpout I would use the Twitter Example from here and replace the Twitter code with a JDBC connection being opened against DB2. If you read some kind of staging table that gets refreshed every x seconds you would read it completely and then check if a specific amount of time has passed ( the nextTuple method is called continuously ) . If you only want to read new tuples you would have to add a WHERE condition based on some timestamp in the DB2 table.

It has also some pointers how to make parallel Spouts in case a single connection is not fast enough.

https://github.com/storm-book/examples-ch04-spouts

https://www.safaribooksonline.com/library/view/getting-started-with/9781449324025/ch04.html

m_damico · ‎02-17-2016

Thanks a lot for your help!

m_damico · ‎02-17-2016

Thanks for your reply. This means that, at the end of the day, a copy of the DB2-Tables will exist in HDFS, and the Storm process will read the data from HDFS? What I try to avoid is in fact to have a copy of the DB2-Database data in HDFS.

bleonhardi · ‎02-17-2016

No if you would implement a JDBCSpout there would be nothing in HDFS at all. Storm by itself has nothing to do with HDFS. It is however often used together with HDFS for storing realtime results. Using the HDFSBolt. I have also seen implementations reading from HDFS as well but its not a requirement.

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_storm-user-guide/content/writing-data-wi...

By default Storm has no dependencies on HDFS. It is not that common to use HDFS as a source anyway, since it normally works on realtime data. ( Kafka, MQ, http calls, TCP input, reading from a spooling directory whatever ).

So if you would implement a JDBCSpout using the DB2 JDBC library it would not store anything in HDFS unless you use an HDFSBolt

vvaks · ‎03-28-2016

Just put your JDBC driver in the classpath and then write the connection to the DB just liek you would from any Java program. Storm is not dependent on HDFS. In fact, you don't need a Hadoop cluster to run Storm. You can read and write on every event that comes through Storm into the DB2 database....

Cloudera Community

Support Questions

Is there any way to keep the data in DB2 and using i.a. Apache Storm?