Created on 11-26-2024 05:17 AM - edited 11-26-2024 05:20 AM
hello,
I'm trying to read a parquet file using the ConvertRecord processor and I'm getting the error:
ConvertRecord[id=e599cd8f-9a1d-3134-4daf-af7bc91cdd57] Failed to process FlowFile[filename=objectTable_tract_5074_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_31_20220314T212509Z-part0_output.parquet]; will route to failure: org.apache.avro.SchemaParseException: Illegal initial character: 0
In my file the columns are numeric and the first one starts with 0 (zero).
Created 11-26-2024 07:11 AM
@alecssander Welcome to the Cloudera Community!
To help you get the best possible solution, I have tagged our NiFi experts @MattWho @SAMSAL who may be able to assist you further.
Please keep us updated on your post, and we hope you find a satisfactory solution to your query.
Regards,
Diana Torres,Created 11-26-2024 10:52 AM
Hi ,
Can you provide more explanation\screenshot of your dataflow and the configuration set on each processor\controller service? Also if you can provide sample data that can be converted to parquet which can then reproduce the error that would be helpful as well.
Thanks
Created on 11-26-2024 12:36 PM - edited 11-26-2024 12:37 PM
The process is simple, I take a parquet file from a bucket and try to insert it into a postresql database:
My file has 301 columns ranging from 0 to 300 with more than 280 lines:
Created on 11-26-2024 02:03 PM - edited 11-26-2024 02:04 PM
It seems like whenever dealing with parquet reader\writer services , those services are trying to use Avro schema, possibly to make sense of the data when passing it along to the target processors ( like PutDatabaseRecord ) since parquet is in binary format. The problem with this is that Avro has limitation on how fields should be called. Actually this is reported as a bug in Jira but it doesnt seem to have been resolved. According to the ticket Avro fields should only start with the following characters [A-Za-z_] . Given this , it seems you have to think of some workaround to address this issue since Nifi doesnt provide a solution out of the box. you can check my answer to this post as an option. Basically, you can use python to read the parquet content and transfer to another format (such as CSV as an example) then pass the CSV to the PutDatabaseRecord. This should work as I have tested it. Since you seem to be using Nifi 2.0 , you can develop python extension processor for this instead of ExecuteStreamCommand mentioned in the post.
Hope that helps. If it does, please accept the solution.
Thanks
Created 11-27-2024 09:20 AM
Thanks for the support
Created 11-27-2024 11:12 AM
Sure, If you come up with a solution different than what I suggested please do post about it so it can help others who might run into similar situation. good luck