Support Questions

bigspark · ‎07-14-2016

Just wondering can we achieve this requirement using Nifi processor. Is it designed for such historical data or bulk upload from 250 tables in sql server to s3.

Iam aware that querydatabasetable processor support incremental updates when a new record is inserted. How to configure this processor to sync with existing record updates and deletes i.e. do we give unique column so that nifi can identify when ever an existing record is updated then it needs to sync it to s3. How to achieve this with nifi. I beleive nifi is not designed for bulk upload.

I have another usecase for bulk upload to s3 from mongodb. Please let me know if nifi can be the right tool for these type of usecases.

If so i believe i need to write a shell script to copy 250 sql server tables to each file in s3 in json format? Plz confirm? whats the best practice to query 250 tables and export to s3?

Thank you.

mburgess · ‎07-14-2016

QueryDatabaseTable would require a "last modified" column in the table(s) in order to detect updates, and probably a "logical delete" flag (i.e. boolean column) in that (or a helper) table in order to detect deletes. This is similar to what Apache Sqoop does.

If you have the Enterprise Edition of SQL Server, you may be able to enable their Change Data Capture feature. Then, for incremental changes, you can use QueryDatabaseTable against the "CDC Table" rather than your source tables.

For strict migration (no incremental fetching of updates) of multiple tables in SQL Server, if you can generate individual flow files, each containing an attribute such as "table.name", then you could parallelize across a NiFi cluster by sending them to ExecuteSQL with the query set to "SELECT * FROM ${table.name}". In this case each instance of ExecuteSQL will get all the rows from one table into an Avro record and send it along the flow.

Regarding MongoDB, I don't believe the MongoDB processors support incremental fetching. QueryDatabaseTable might work on flat documents, but there is a bug that prevents nested fields from being returned, and aliasing the columns won't work for the incremental fetch part. However ExecuteSQL will work if you explicitly list (and alias) the document fields in the SQL statement, but that won't do incremental fetch.

You might be able to use Sqoop for such things, but there are additional requirements if using sqoop-import-all-tables, and if doing incremental fetch you'd need 250 calls to sqoop import.

Do your tables all have a "last modified" column or some sort of similar structure? Supporting distributed incremental fetch for arbitrary tables is a difficult problem as you'd need to know the appropriate "last modified" column for each table (if they're not named the same and/or present in every table). When tables all behave the same way from an update perspective, it makes this problem much easier.

View solution in original post

mburgess · ‎07-14-2016

QueryDatabaseTable would require a "last modified" column in the table(s) in order to detect updates, and probably a "logical delete" flag (i.e. boolean column) in that (or a helper) table in order to detect deletes. This is similar to what Apache Sqoop does.

If you have the Enterprise Edition of SQL Server, you may be able to enable their Change Data Capture feature. Then, for incremental changes, you can use QueryDatabaseTable against the "CDC Table" rather than your source tables.

For strict migration (no incremental fetching of updates) of multiple tables in SQL Server, if you can generate individual flow files, each containing an attribute such as "table.name", then you could parallelize across a NiFi cluster by sending them to ExecuteSQL with the query set to "SELECT * FROM ${table.name}". In this case each instance of ExecuteSQL will get all the rows from one table into an Avro record and send it along the flow.

Regarding MongoDB, I don't believe the MongoDB processors support incremental fetching. QueryDatabaseTable might work on flat documents, but there is a bug that prevents nested fields from being returned, and aliasing the columns won't work for the incremental fetch part. However ExecuteSQL will work if you explicitly list (and alias) the document fields in the SQL statement, but that won't do incremental fetch.

You might be able to use Sqoop for such things, but there are additional requirements if using sqoop-import-all-tables, and if doing incremental fetch you'd need 250 calls to sqoop import.

Do your tables all have a "last modified" column or some sort of similar structure? Supporting distributed incremental fetch for arbitrary tables is a difficult problem as you'd need to know the appropriate "last modified" column for each table (if they're not named the same and/or present in every table). When tables all behave the same way from an update perspective, it makes this problem much easier.

bigspark · ‎07-14-2016

@Matt Burgess will look into the data and get back to you Matt. Thank you very much:)

bigspark · ‎07-15-2016

@Matt Burgess i think you missed this bit but just checking with you again is this an effective solution to import bulk batch data from SQL server or mongodb using nifi to Amazon S3 and also keeping it in sync with updates and deletes? Is nifi designed for this purpose? We are looking to run these sync updates or deletes as overnight jobs in nifi? Plz reply thank you

Cloudera Community

Support Questions

Is it a best practice to use nifi querydatabase table to export around several gb of data from sql server to s3 from 250 tables and use incremental update?