Support Questions

aishwarya020793 · ‎02-22-2017

dineshc · ‎02-22-2017

Ideally you could adapt the Nifi Change Data Capture use case to extract data from MySQL, Oracle, MSSQL and other traditional RDMS.

You could refer this article on HCC - https://community.hortonworks.com/articles/55422/change-data-capture-using-nifi.html

Here is an overview:

When supported by the RDBMS that manages the source data table, turn on the table's CDC feature, which automatically creates in the background a dedicated CDC table which contains all of the columns in the source data table, as well as additional metadata columns that can be used to support down-stream ETL logic processing. The RDBMS will automatically detect the new and changed records within the source data table for you, and will duplicate those new and changed records into the dedicated CDC table. Against that dedicated CDC table, execute a QueryDatabaseTable processor which uses an SQL SELECT query to fetch the latest records written to the CDC table (since the last time the QueryDatabaseTable processor executed successfully).

If the source data table has columns which hold the time-stamps of when the record was first created, or if recently updated, and if you do not have access to a Hadoop environment which supports Sqoop, you can still use NiFi to bulk extract the records in the source data table in parallel, using streams. First, you logically fragment the source data table into windows of time, such as a given month of a given year. For each window of time, create a corresponding QueryDatabaseTable processor. In this way, you can easily execute the extract across N threads of a NiFi node (or on N NiFi nodes). Essentially, you create the first QueryDatabaseTable, clone it N-1 time, and simply edit the predicate expression of each SQL SELECT so that it fetches the source data table records that were created within the desired window of time.

If the source data table has the time-stamp for update events as well, then clones of these bulk extract QueryDatabaseTable processes can be slightly modified and used to grab on-going updates to records, as well as new records, created within those windows of time. These on-going CDC type QueryDatabaseTable processes can be scheduled to execute based on the probability of update events for a given window of time. The update time-stamp can then be used by NiFi to hand off individual CDC records to specific NiFi processors for routing, mediation, data transformation, data aggregation, data egress (e.g., PutKafka).

View solution in original post

dineshc · ‎02-22-2017

@aishwarya srivastava

Ideally you could adapt the Nifi Change Data Capture use case to extract data from MySQL, Oracle, MSSQL and other traditional RDMS.

You could refer this article on HCC - https://community.hortonworks.com/articles/55422/change-data-capture-using-nifi.html

Here is an overview:

When supported by the RDBMS that manages the source data table, turn on the table's CDC feature, which automatically creates in the background a dedicated CDC table which contains all of the columns in the source data table, as well as additional metadata columns that can be used to support down-stream ETL logic processing. The RDBMS will automatically detect the new and changed records within the source data table for you, and will duplicate those new and changed records into the dedicated CDC table. Against that dedicated CDC table, execute a QueryDatabaseTable processor which uses an SQL SELECT query to fetch the latest records written to the CDC table (since the last time the QueryDatabaseTable processor executed successfully).

If the source data table has columns which hold the time-stamps of when the record was first created, or if recently updated, and if you do not have access to a Hadoop environment which supports Sqoop, you can still use NiFi to bulk extract the records in the source data table in parallel, using streams. First, you logically fragment the source data table into windows of time, such as a given month of a given year. For each window of time, create a corresponding QueryDatabaseTable processor. In this way, you can easily execute the extract across N threads of a NiFi node (or on N NiFi nodes). Essentially, you create the first QueryDatabaseTable, clone it N-1 time, and simply edit the predicate expression of each SQL SELECT so that it fetches the source data table records that were created within the desired window of time.

If the source data table has the time-stamp for update events as well, then clones of these bulk extract QueryDatabaseTable processes can be slightly modified and used to grab on-going updates to records, as well as new records, created within those windows of time. These on-going CDC type QueryDatabaseTable processes can be scheduled to execute based on the probability of update events for a given window of time. The update time-stamp can then be used by NiFi to hand off individual CDC records to specific NiFi processors for routing, mediation, data transformation, data aggregation, data egress (e.g., PutKafka).

aishwarya020793 · ‎02-22-2017

hi @Dinesh Chitlangia thanks for the solution, but i don't have to use sandbox. I just need to connect to my sample mysql data and write the data of tables into files. How should I proceed?

dineshc · ‎02-22-2017

Yes you do not necessarily have to use the sandbox. The article link I shared was of a tutorial and hence it used a sandbox.

In your case you can do the required tasks in your hadoop cluster as required.

aishwarya020793 · ‎02-22-2017

Okk thanks

aishwarya020793 · ‎02-22-2017

@Matt Clarke @Pierre Villard @Timothy Spann

Cloudera Community

Support Questions

hi, how can i extract MySQL data and put into file through nifi? please help.