Created 12-18-2024 01:53 AM
I am using CSVReader and CSVRecordSetWriter using the Infer schema setting. But when I have values such as "030", even though the all the values has double quotes enclosed. When I write the schema into avro.schema, I see it is considered as type 'int'. But I want to treat it as string. Because of this our output after the processor looks like 30 and the first 0 is omitted. I want to use Infer schema property only because, I want to read the csv files dynamically without hardcoding the schema and use that.
Created on 12-19-2024 06:48 AM - edited 12-19-2024 06:50 AM
Hi @phadkev ,
Since you dont care if all the values come as string , then I would suggest doing the following:
1- Use the ExtractRecordSchema processor to generate the avro.schema attribute with the record schema as inferred by Nifi (Available 1.26+ version ) as follows:
The CSVReader for this processor will use infer schema strategy.
once you pass your CSV input through this processor you will have new flowfile attribute avro.schema with the following value
As Expected some of the values like name are assigned int type which we will take care of in the next step.
2- Use UpdateAttribute to replace any int type with string type inside the avro.schema attribute as follows:
The Expression Language used to re set the avro.schme attribute:
${avro.schema:replace("int","string")}
3- Use the QueryRecord with a different CSVReader from step 1 where this one uses the "use Schema Text Property" . Notice how by default the Shema Text property is set to the avro.schema attribute, which we generated from step 1&2:
Also make sure you set the same strategy for the CSVRecordWriter to ensure that the read and written CSV will be in the desired format.
Hope that helps. If it does, please accept the solution.
Thanks
Created 12-20-2024 05:49 AM
Hi @phadkev ,
Im glad its working for you. Something I forgot to mention is that this will perform OK if your CSV dataset is small in size. Im not sure how inferring schema works but I imagine it does full scan on data to determine the appropriate type and if this the case this operation will be costly with large data. That is why its always recommend to know your schema and pass it when working with records.
Having said this , I was concerned about the performance of my suggestion above so I did more research and found that actually there is an easier way to convert everything to string in much faster way where you don't need the extra processors. The suggestion is based on this post.
Basically, all you have to do in the CSVReader is to set the Access Schema Strategy to "Use String Fields From Header".
The CSVRecordWriter Access Schema Strategy should be set to "Inherit Record Schema"
That should do it. give it try and see how it goes.
Thanks
Created 12-18-2024 07:44 AM
Hi @phadkev ,
I'm not sure if there is a way around this without using Schema. You mentioned that you want to read the CSV files dynamically but does that mean you dont know what kind of CSV data you will get ? if so how are you parsing the data after conversion? what processor are you using the CSVReader & Write in ? maybe you can elaborate more on what you are trying to do with the data beginning to end to see if we can help in another way.
Created 12-18-2024 09:16 PM
Hi @SAMSAL ,
I'm using the CSV readers and writers in Query Record processor and CSVtoJSON Record Converter processor.
I am expecting around 10 types of CSV files(I mean to say, files with different columns). But this number might increase in the future upto 50. Even though I know what will these 50 files exactly contain, I'll still have to write all the 50 schemas in a schema registry. Which is complex to maintain and work with.
So I thought of this method in which I'll append all the values in a csv file with double quotes such that the reader and writer will consider it as a string. But it still continues to read it as an integer
Input File:
col1,col2,col3,col4,col5
999,C10,100,010,0
999,C06,10,010,0
This is what the CSV writer is giving:
Created on 12-19-2024 06:48 AM - edited 12-19-2024 06:50 AM
Hi @phadkev ,
Since you dont care if all the values come as string , then I would suggest doing the following:
1- Use the ExtractRecordSchema processor to generate the avro.schema attribute with the record schema as inferred by Nifi (Available 1.26+ version ) as follows:
The CSVReader for this processor will use infer schema strategy.
once you pass your CSV input through this processor you will have new flowfile attribute avro.schema with the following value
As Expected some of the values like name are assigned int type which we will take care of in the next step.
2- Use UpdateAttribute to replace any int type with string type inside the avro.schema attribute as follows:
The Expression Language used to re set the avro.schme attribute:
${avro.schema:replace("int","string")}
3- Use the QueryRecord with a different CSVReader from step 1 where this one uses the "use Schema Text Property" . Notice how by default the Shema Text property is set to the avro.schema attribute, which we generated from step 1&2:
Also make sure you set the same strategy for the CSVRecordWriter to ensure that the read and written CSV will be in the desired format.
Hope that helps. If it does, please accept the solution.
Thanks
Created 12-19-2024 10:14 PM
Hi @SAMSAL ,
Thanks this did work.
Created 12-20-2024 05:49 AM
Hi @phadkev ,
Im glad its working for you. Something I forgot to mention is that this will perform OK if your CSV dataset is small in size. Im not sure how inferring schema works but I imagine it does full scan on data to determine the appropriate type and if this the case this operation will be costly with large data. That is why its always recommend to know your schema and pass it when working with records.
Having said this , I was concerned about the performance of my suggestion above so I did more research and found that actually there is an easier way to convert everything to string in much faster way where you don't need the extra processors. The suggestion is based on this post.
Basically, all you have to do in the CSVReader is to set the Access Schema Strategy to "Use String Fields From Header".
The CSVRecordWriter Access Schema Strategy should be set to "Inherit Record Schema"
That should do it. give it try and see how it goes.
Thanks
Created 12-20-2024 07:26 AM
@SAMSAL , This is indeed more efficient. Thanks for letting me know.