Created 08-13-2020 01:07 PM
We have a requirement wherein we get different type of input files i.e avro, json and csv (example: client1_product.avro,client2_product.json,client3_product.csv, etc) and we have to convert these files to csv and load to db2.
Currently I have designed a job with GetFile to fetch the avro files and ConvertRecordProcessor to convert records (avro to csv) and load it to db2 and this works fine but we need to design a generic job to read any file format (i.e avro, json and csv) and load it to db2 table.
Can we design Nifi job to read different format files and convert it to a csv format in same job?
Could you please advise.
Thanks,
Avi
Created on 08-14-2020 06:15 AM - edited 08-14-2020 06:22 AM
@avi166 This is a common use case for nifi to create a data flow that is a single entry point for data files of different expected types up to an including all types. For example you an create an API with HandleHttpRequest/HandleHttpResponse to accept a post of a file. Another example is using getFile/ListFile/etc at the top of a flow to read a directory. Another new common example would be to get the files from Amazon S3.
After the top of the flow where files arrive inbound, it is common to create a single flow with a single branch for a specific use case. This is how you have created it for CSV. To improve your flow you would add RouteOnAttribute to check the file name ends in CSV. This would create a "csv" route which you would then direct downstream the flow you created. Next you similarly split the flow for other types TXT, AVRO, etc, and then one for unmatched type. Once the split is made for each you can now create separate branches (data flows) add additional processors that needed to prepare each type for insertion. Sometimes you can create a branch that can handle multiple types too. Some split branches may take 3-5+ processors to prepare for DB2 while others maybe even just 1 or 2 to prepare the data.
When all the different data flow branches are ready, you then route them all back to a single processor or processor group to handle insert into DB2. So you have a flow that is a single entry, that splits into many branches, and then rejoins at the bottom. While working and operating in this manner you may make separate flow branches and realize later you could combine them by making a new branch that is a lil more dynamic. You should always be looking to improve your flows over time in this manner.
If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post.
Thanks,
Steven @ DFHZ
Created 08-15-2020 04:35 AM
@avi166 I think by the time you get to RouteOnAttribute you should have already read the file, but there isn't really a right or wrong answer. One of the things I like the most about nifi is that there are many different ways to achieve the same end result.
To answer your next question, you may be able to use the same flow for different CSV file structures, and you should if you can but dont be afraid to split the flow again as I have outline above. These may need different schemas, different record reader, but same record writer. Rejoin again when files are ready to converge into the same processor or branch of functionality.
I also tried to point out that at first you may, for example, have to route some CSVs to a different file structure branch. Then by finishing all csv branches, and knowing the differences for each, you should be able to make a final more dynamic branch to replace 1 or more previous branches. This is tuning and optimization steps that you really won't know until you evaluate that final flow branch against the previous versions.
Created on 08-14-2020 06:15 AM - edited 08-14-2020 06:22 AM
@avi166 This is a common use case for nifi to create a data flow that is a single entry point for data files of different expected types up to an including all types. For example you an create an API with HandleHttpRequest/HandleHttpResponse to accept a post of a file. Another example is using getFile/ListFile/etc at the top of a flow to read a directory. Another new common example would be to get the files from Amazon S3.
After the top of the flow where files arrive inbound, it is common to create a single flow with a single branch for a specific use case. This is how you have created it for CSV. To improve your flow you would add RouteOnAttribute to check the file name ends in CSV. This would create a "csv" route which you would then direct downstream the flow you created. Next you similarly split the flow for other types TXT, AVRO, etc, and then one for unmatched type. Once the split is made for each you can now create separate branches (data flows) add additional processors that needed to prepare each type for insertion. Sometimes you can create a branch that can handle multiple types too. Some split branches may take 3-5+ processors to prepare for DB2 while others maybe even just 1 or 2 to prepare the data.
When all the different data flow branches are ready, you then route them all back to a single processor or processor group to handle insert into DB2. So you have a flow that is a single entry, that splits into many branches, and then rejoins at the bottom. While working and operating in this manner you may make separate flow branches and realize later you could combine them by making a new branch that is a lil more dynamic. You should always be looking to improve your flows over time in this manner.
If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post.
Thanks,
Steven @ DFHZ
Created 08-14-2020 11:15 AM
Hi Steve,
Perfect, Thanks a lot. Have a question on the input file structure, once the RouteOnAttribute checks the file name ends in .CSV. This would create a "csv" route then direct downstream to read the file, can down stream read different file structure (example: file1 has id,name, loc and file2 has id and name), can we configure ConvertRecord processor to convert the same format files with different file structure to a generic file structure.
Thanks,
Avinash
Created 08-15-2020 04:35 AM
@avi166 I think by the time you get to RouteOnAttribute you should have already read the file, but there isn't really a right or wrong answer. One of the things I like the most about nifi is that there are many different ways to achieve the same end result.
To answer your next question, you may be able to use the same flow for different CSV file structures, and you should if you can but dont be afraid to split the flow again as I have outline above. These may need different schemas, different record reader, but same record writer. Rejoin again when files are ready to converge into the same processor or branch of functionality.
I also tried to point out that at first you may, for example, have to route some CSVs to a different file structure branch. Then by finishing all csv branches, and knowing the differences for each, you should be able to make a final more dynamic branch to replace 1 or more previous branches. This is tuning and optimization steps that you really won't know until you evaluate that final flow branch against the previous versions.
Created 08-15-2020 02:46 PM
Thanks Steven