So, Our goal is to pull data from DB and push it to HDFS. This is streaming data, so we want to make sure it pulls in incremental data every hour or so. Now it seems Nifi is a good choice for that. However, given the lack of documentation, we are trying to understand if there are any examples on the data flow from SQL to a log file or to HDFS. To be more precise we want to know the processors that we can use. For Eg: http://www.slideshare.net/hortonworks/design-a-dataflow-in-7-minutes-58718224
The example here pulls data from Twitter to HDFS. We need it from SQL to HDFS.
You can create a flow with QueryDatabaseTable and use some criteria like max timestamp and then -->putHDFS processor to publish it to HDFS.
To address your documentation comment, the processor documentation is readily available here, specifically for @milind pandit's suggestion for QueryDatabaseTable. In NiFi 1.0 / HDF 2.0, there will also be a GenerateTableFetch processor, which is like QueryDatabaseTable except it generates flow files containing SQL rather than executing the SQL. This allows you (on a cluster) to partition the table to fetch in parallel using ExecuteSQL to get the records.
The important aspect in both cases is that you have a column (like "max timestamp" mentioned in the original answer) that is incremented when new records/rows are added. Sometimes the column is the unique ID / primary key, sometimes it is a timestamp, etc. The criterion is that new rows will be fetched based on the last maximum value for the column(s) you specify in the processor.
Pay attention what if you use GenerateTableFetch, sometimes you can get duplicate rows after ExecuteSQL processor.
Matt, if you know how to resolved issue with GenerateTableFetch, please share.
Another option you can look at is the standard executeSQL processor. You'll have to manage your own start date and end date of the range you're querying as an attribute and ensure you're incrementing properly. We built a loop that would continuously increment these date range attributes that were passed into the executeSQL to be executed. (behaved almost like real time depending on how fast your DB can return the results)
QueryDatabaseTable is definitely the preferred approach but beucase of how our DB was indexed, we werent able to use QueryDatabaseTable efficiently)