Thanks for your answer SAMSAL.
I was hoping to be able to use a processor directly to add my schema but if that's not possible, I'll use a script.
As well as changing the names of several columns, I also need to change the type of some of them, as some are of type "large_string" and one is of type "bool". I had this error for example when I tried to add the schema (retrieved with Python code from my Parquet file) to the ConvertRecord processor:
'schema-text' validated against '{
"type": "record",
"name": "de_train",
"fields": [
{
"name": "cell_type",
"type": "string"
},
{
"name": "sm_name",
"type": "string"
},
{
"name": "sm_lincs_id",
"type": "string"
},
{
"name": "SMILES",
"type": "string"
},
{
"name": "control",
"type": "bool"
},
{
"name": "A1BG",
"type": "double"
},
{
"name": "A1BG_AS1",
"type": "double"
},
{
"name": "A2M",
"type": "double"
},
{
"name": "A2M_AS1",
"type": "double"
},
{
"name": "A2MP1",
"type": "double"
}
]
}' is invalid because Not a valid Avro Schema: "bool" is not a defined name. The type of the "control" field must be a defined name or a {"type": ...} expression.
I had to change "large_string" to "string" and "bool" to "boolean" to get no more errors in the AvroSchemaRegistry.
So how do I change the types in a Parquet file? Is it possible to do this from the dataframe as well as for names?