Created 03-07-2022 07:17 AM
Hello all,
I would like to know what would be the best approach, to change a schema of flowfile.
I have got the following scenario :
I ingest a csv file, and I would like then to rename column name inferred from the header and put them in a parquet file. Additionally, I want to cast some columns (String) to Integer or other types.
I think it may be possible to do this job using the convertRecord component, or may be with the QueryRecord component but I am not sure if it is the best approach.
Do you have any idea?
If you have got few example it would be nice too.
Thanks
Created on 03-07-2022 12:47 PM - edited 03-07-2022 12:48 PM
I believe the best alternative for you would be to use a fixed schema rather than "Infer Schema".
Create a parameter with the schema that specifies the exact structure and data types that you want to use and configure your RecordReader setting that parameter in the "Schema Text" property of the RecordReader and setting the Schema Strategy to "Use Schema Text".
Cheers,
André
--
Did this response answer your question? If so, please take a moment to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Created on 03-07-2022 12:47 PM - edited 03-07-2022 12:48 PM
I believe the best alternative for you would be to use a fixed schema rather than "Infer Schema".
Create a parameter with the schema that specifies the exact structure and data types that you want to use and configure your RecordReader setting that parameter in the "Schema Text" property of the RecordReader and setting the Schema Strategy to "Use Schema Text".
Cheers,
André
--
Did this response answer your question? If so, please take a moment to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Created 03-08-2022 02:25 AM
Thanks for the answer.
From this : https://community.cloudera.com/t5/Support-Questions/I-want-to-generate-Avro-Schema-from-CSV-file-usi...
I took a convertRecord processor, putted my CSV inside this processor, and then I inferred schema with the CSV reader and wrote with CSV Writer with the option : Set 'avro.schema' Attribute. This allowed me to get an Avro Schema without having to type everything.
I putted this schema as the Schema Text property of the CsvRecordReader setting the Schema Strategy to "Use Schema Text" as you mentioned, but it seems that this strategy do not allow me to transform String to Int (I get an error "Could not parse incoming Data", that I can remove if I replace the type int of my schema to string)
I am not sure if the ConvertRecord processor is able to convert a String from the CSV, to an int, base on the schema i defined. May be it's my understanding of avro which is not good.
Created 03-08-2022 02:37 AM
For ConvertRecord to be able to convert the values, the need to be valid. If you have a CSV field containing the string "123", it will be converted to the integer 123. If the same field has a string "ABC", the conversion will fail.
If you data may have invalid data you can validate the records using the ValidateRecord processor and your schema, and route the invalid records to a different queue or flow branch.
Regards,
André
--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Created 03-08-2022 02:51 AM
Well I found the issue. I was the failing point 😁 I didn't put the option Treat First Line as Header to True so because my header was treat as a row, the schema was not Valid.
Thanks for all, it seems to be a good way to change schema so I'll close this topic