Support Questions

Find answers, ask questions, and share your expertise

Problem when trying to convert parquet file

avatar
New Contributor

hello,
I'm trying to read a parquet file using the ConvertRecord processor and I'm getting the error:

ConvertRecord[id=e599cd8f-9a1d-3134-4daf-af7bc91cdd57] Failed to process FlowFile[filename=objectTable_tract_5074_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_31_20220314T212509Z-part0_output.parquet]; will route to failure: org.apache.avro.SchemaParseException: Illegal initial character: 0

In my file the columns are numeric and the first one starts with 0 (zero).

1 ACCEPTED SOLUTION

avatar
Super Guru

Sure, If you come up with a solution different than what I suggested please do post about it so it can help others who might run into similar situation. good luck

View solution in original post

7 REPLIES 7

avatar
Community Manager

@alecssander Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our NiFi experts @MattWho @SAMSAL  who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Super Guru

Hi ,

Can you provide more explanation\screenshot of your dataflow and the configuration set on each processor\controller service? Also if you can provide sample data that can be converted to parquet which can then reproduce the error that would be helpful as well.

Thanks

avatar
New Contributor

The process is simple, I take a parquet file from a bucket and try to insert it into a postresql database:
alecssander_0-1732653085450.pngalecssander_2-1732653358222.png

My file has 301 columns ranging from 0 to 300 with more than 280 lines:alecssander_1-1732653296212.png

 

 

avatar
Super Guru

It seems like whenever dealing with parquet reader\writer services , those services are trying to use Avro schema, possibly to make sense of the data   when passing it along to the target processors ( like PutDatabaseRecord ) since parquet is in binary format. The problem with this is that Avro has limitation on how fields should be called. Actually this is reported as a bug in Jira but it doesnt seem to have been resolved. According to the ticket Avro fields should only start with the following characters [A-Za-z_] . Given this , it seems you have to think of some workaround to address this issue since Nifi doesnt provide a solution out of the box. you can check my answer to this post as an option. Basically, you can use python to read the parquet content and transfer to another format (such as CSV as an example) then pass the CSV to the PutDatabaseRecord. This should work as I have tested it. Since you seem to be using Nifi 2.0 , you can develop python extension processor  for this instead of ExecuteStreamCommand mentioned in the post.

Hope that helps. If it does, please accept the solution.

Thanks

 

 

 

avatar
New Contributor

Thanks for the support

avatar
Super Guru

Sure, If you come up with a solution different than what I suggested please do post about it so it can help others who might run into similar situation. good luck

avatar
Community Manager

@alecssander Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.  Thanks.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: