Support Questions

Find answers, ask questions, and share your expertise

ParquetReader incorrectly reading arrays

avatar
New Contributor

I'm working in with nifi to grab parquet files from a S3 bucket. But when I read in the parquet files the arrays in the data end up with the following format:

 

[
    {
        "id": 1,
        "name": "John",
        "address": {
            "street": "Main St",
            "city": "New York"
        },
        "hobbies": [
            {
                "element": "coding"
            },
            {
                "element": "music"
            }
        ],
        "greetings": [
            {
                "element": {
                    "intro": "hello",
                    "end": "bye"
                }
            },
            {
                "element": {
                    "intro": "hola",
                    "end": "adios"
                }
            }
        ],
        "gender": [
            {
                "element": "M"
            }
        ],
        "record_id": [
            {
                "element": "2a2c6c86947719eacc1742adf1d6f2c7"
            }
        ]
    }
]

 

Instead of the desired format:

 

[
    {
        "id": 1,
        "name": "John",
        "address": {
            "street": "Main St",
            "city": "New York"
        },
        "hobbies": [
            "coding",
            "music"
        ],
        "greetings": [
            {
                "intro": "hello",
                "end": "bye"
            },
            {
                "intro": "hola",
                "end": "adios"
            }
        ],
        "gender": [
            "M"
        ],
        "record_id": [
            "2a2c6c86947719eacc1742adf1d6f2c7"
        ]
    }
]

 

The downstream processes cannot be changed and cannot handle the arrays with the repeated 1D maps.

When I try to use a ConvertRecord processor to write the records out with a ParquetRecordSetWriter to get the arrays formatted correctly I get the following error:

 

 

schema_error.png

There are a variety of fields that are arrays in the data so it's not feasible to specify handling for each array field. Is there some schema handling I can do with the ConvertRecord to avoid this error? It seems like it's writing the data out in the correct format and running into the schema conflict because it. Alternatively, is there a better way to handle nested data coming from parquet files?

1 REPLY 1

avatar
Community Manager

@birdy Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our NiFi experts @SAMSAL @MattWho  who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: