ParquetReader incorrectly reading arrays

birdy — Mon, 20 Jan 2025 20:22:18 GMT

I'm working in with nifi to grab parquet files from a S3 bucket. But when I read in the parquet files the arrays in the data end up with the following format:

[ { "id": 1, "name": "John", "address": { "street": "Main St", "city": "New York" }, "hobbies": [ { "element": "coding" }, { "element": "music" } ], "greetings": [ { "element": { "intro": "hello", "end": "bye" } }, { "element": { "intro": "hola", "end": "adios" } } ], "gender": [ { "element": "M" } ], "record_id": [ { "element": "2a2c6c86947719eacc1742adf1d6f2c7" } ] } ]

Instead of the desired format:

[ { "id": 1, "name": "John", "address": { "street": "Main St", "city": "New York" }, "hobbies": [ "coding", "music" ], "greetings": [ { "intro": "hello", "end": "bye" }, { "intro": "hola", "end": "adios" } ], "gender": [ "M" ], "record_id": [ "2a2c6c86947719eacc1742adf1d6f2c7" ] } ]

The downstream processes cannot be changed and cannot handle the arrays with the repeated 1D maps.

When I try to use a ConvertRecord processor to write the records out with a ParquetRecordSetWriter to get the arrays formatted correctly I get the following error:

There are a variety of fields that are arrays in the data so it's not feasible to specify handling for each array field. Is there some schema handling I can do with the ConvertRecord to avoid this error? It seems like it's writing the data out in the correct format and running into the schema conflict because it. Alternatively, is there a better way to handle nested data coming from parquet files?

Re: ParquetReader incorrectly reading arrays

DianaTorres — Mon, 20 Jan 2025 22:21:17 GMT

@birdy Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our NiFi experts @SAMSAL @MattWho who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.

question ParquetReader incorrectly reading arrays in Support Questions

ParquetReader incorrectly reading arrays

Re: ParquetReader incorrectly reading arrays