Support Questions

phadkev · ‎12-18-2024

I am using CSVReader and CSVRecordSetWriter using the Infer schema setting. But when I have values such as "030", even though the all the values has double quotes enclosed. When I write the schema into avro.schema, I see it is considered as type 'int'. But I want to treat it as string. Because of this our output after the processor looks like 30 and the first 0 is omitted. I want to use Infer schema property only because, I want to read the csv files dynamically without hardcoding the schema and use that.

SAMSAL · ‎12-19-2024

Hi @phadkev ,

Since you dont care if all the values come as string , then I would suggest doing the following:

1- Use the ExtractRecordSchema processor to generate the avro.schema attribute with the record schema as inferred by Nifi (Available 1.26+ version ) as follows:

The CSVReader for this processor will use infer schema strategy.

once you pass your CSV input through this processor you will have new flowfile attribute avro.schema with the following value

As Expected some of the values like name are assigned int type which we will take care of in the next step.

2- Use UpdateAttribute to replace any int type with string type inside the avro.schema attribute as follows:

The Expression Language used to re set the avro.schme attribute:

${avro.schema:replace("int","string")}

3- Use the QueryRecord with a different CSVReader from step 1 where this one uses the "use Schema Text Property" . Notice how by default the Shema Text property is set to the avro.schema attribute, which we generated from step 1&2:

Also make sure you set the same strategy for the CSVRecordWriter to ensure that the read and written CSV will be in the desired format.

Hope that helps. If it does, please accept the solution.

Thanks

View solution in original post

SAMSAL · ‎12-20-2024

Hi @phadkev ,

Im glad its working for you. Something I forgot to mention is that this will perform OK if your CSV dataset is small in size. Im not sure how inferring schema works but I imagine it does full scan on data to determine the appropriate type and if this the case this operation will be costly with large data. That is why its always recommend to know your schema and pass it when working with records.

Having said this , I was concerned about the performance of my suggestion above so I did more research and found that actually there is an easier way to convert everything to string in much faster way where you don't need the extra processors. The suggestion is based on this post.

Basically, all you have to do in the CSVReader is to set the Access Schema Strategy to "Use String Fields From Header".

The CSVRecordWriter Access Schema Strategy should be set to "Inherit Record Schema"

That should do it. give it try and see how it goes.

Thanks

View solution in original post

SAMSAL · ‎12-18-2024

Hi @phadkev ,

I'm not sure if there is a way around this without using Schema. You mentioned that you want to read the CSV files dynamically but does that mean you dont know what kind of CSV data you will get ? if so how are you parsing the data after conversion? what processor are you using the CSVReader & Write in ? maybe you can elaborate more on what you are trying to do with the data beginning to end to see if we can help in another way.

phadkev · ‎12-18-2024

Hi @SAMSAL ,

I'm using the CSV readers and writers in Query Record processor and CSVtoJSON Record Converter processor.

I am expecting around 10 types of CSV files(I mean to say, files with different columns). But this number might increase in the future upto 50. Even though I know what will these 50 files exactly contain, I'll still have to write all the 50 schemas in a schema registry. Which is complex to maintain and work with.

So I thought of this method in which I'll append all the values in a csv file with double quotes such that the reader and writer will consider it as a string. But it still continues to read it as an integer

Input File:

col1,col2,col3,col4,col5
999,C10,100,010,0
999,C06,10,010,0

This is what the CSV writer is giving:

avro.schema

{"type":"record","name":"nifiRecord","namespace":"org.apache.nifi","fields":[{"name":"col1","type":["int","null"]},{"name":"col2","type":["int","null"]},{"name":"col3","type":["string","null"]},{"name":"col4","type":["int","null"]},{"name":"col5","type":["int","null"]}]}

SAMSAL · ‎12-19-2024

Hi @phadkev ,

Since you dont care if all the values come as string , then I would suggest doing the following:

1- Use the ExtractRecordSchema processor to generate the avro.schema attribute with the record schema as inferred by Nifi (Available 1.26+ version ) as follows:

The CSVReader for this processor will use infer schema strategy.

once you pass your CSV input through this processor you will have new flowfile attribute avro.schema with the following value

As Expected some of the values like name are assigned int type which we will take care of in the next step.

2- Use UpdateAttribute to replace any int type with string type inside the avro.schema attribute as follows:

The Expression Language used to re set the avro.schme attribute:

${avro.schema:replace("int","string")}

3- Use the QueryRecord with a different CSVReader from step 1 where this one uses the "use Schema Text Property" . Notice how by default the Shema Text property is set to the avro.schema attribute, which we generated from step 1&2:

Also make sure you set the same strategy for the CSVRecordWriter to ensure that the read and written CSV will be in the desired format.

Hope that helps. If it does, please accept the solution.

Thanks

phadkev · ‎12-19-2024

Hi @SAMSAL ,

Thanks this did work.

SAMSAL · ‎12-20-2024

Hi @phadkev ,

Im glad its working for you. Something I forgot to mention is that this will perform OK if your CSV dataset is small in size. Im not sure how inferring schema works but I imagine it does full scan on data to determine the appropriate type and if this the case this operation will be costly with large data. That is why its always recommend to know your schema and pass it when working with records.

Having said this , I was concerned about the performance of my suggestion above so I did more research and found that actually there is an easier way to convert everything to string in much faster way where you don't need the extra processors. The suggestion is based on this post.

Basically, all you have to do in the CSVReader is to set the Access Schema Strategy to "Use String Fields From Header".

The CSVRecordWriter Access Schema Strategy should be set to "Inherit Record Schema"

That should do it. give it try and see how it goes.

Thanks

phadkev · ‎12-20-2024

@SAMSAL , This is indeed more efficient. Thanks for letting me know.

Cloudera Community

Support Questions

CSVReader and CSVRecordSetWriter doesn't consider interger values present in double quotes as string