Support Questions
Find answers, ask questions, and share your expertise

How to read json from S3 then edit json using Nifi

Explorer

I am using Nifi 1.6.0. I am trying to copy an S3 file into redshift. The json file on S3  looks like this:
[

{"a":1,"b":2},

{"a":3,"b":4}
]
However this gives an error because of '[' and ']' in the file (https://stackoverflow.com/a/45348425).

I need to convert the json from above format to a format which looks like this:

{"a":1,"b":2}

{"a":3,"b":4}

So basically what I am trying to do is to remove '[' and ']' and replace '},' with '}'.

The file has over 14 million rows (30 MB in size)

How can I achieve this using Nifi?

1 ACCEPTED SOLUTION

Master Collaborator

@baymax277 ,

 

Try this:

araujo_0-1655951401398.png

This is the regex:

(?s)(^\[|\]$)

You may have to increase the buffer size to accommodate your file.

 

Cheers,

André

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

View solution in original post

11 REPLIES 11

Master Collaborator

@baymax277 ,

 

The problem with your file is not the square brackets ([ ]). The issue is that single quotes are not a valid JSON quote. The correct JSON would be:

[
{"a":1,"b":2},
{"a":3,"b":4}
]

 

The content above is processed correctly by NiFi if you use record-based processors (like, for example QueryRecord).

 

Ideally your source data should be fixed to provide a valid JSON. For example, if those files are being generated by a Python script like this:

my_object = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
print(my_object)

The Python script should be updated to do this instead:

import json
my_object = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
print(json.dumps(my_object))

 

If you cannot have the source data fixed, you can try to replace the single quotes with double quotes using a ReplaceText processor with the following configuration.

 

araujo_0-1655941522722.png

 

This is a simplistic approach, though, and can cause problems in some complex JSON objects if you have strings with unexpected sequences of characters in them.

 

Cheers,

André

 

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Explorer

@araujo 
Sorry, my bad. I made a mistake in the json representation mentioned in the question. The records actually have double quotes ("). I have updated the question. 

Master Collaborator

@baymax277 ,

 

Could you share a screenshot of the part of you flow that reads and processes the JSON content?

 

André

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Explorer

@araujo 
The jsons files are compressed and are placed in an S3 bucket. Here's the flow I have created to fetch s3 file and decompress it.

baymax277_0-1655945817170.png

 

Master Collaborator

@baymax277 ,

 

Could you also share the screenshot of the part of the flow where the error is happening? I think that's after the Decompress File step, right?

 

André

 

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Explorer

@araujo , Right now there is no error in Nifi flow. But when I tried to execute a copy command (from an SQL query editor) on the file, it gave me an error "Invalid JSONPath format: Member is not an object. ". When I searched about this error, I found this - https://stackoverflow.com/a/45348425

 

This is why I wanted to convert the file.

Master Collaborator

@baymax277 

 

Ok. Can you share a screenshot of the part of the flow that insert the data into your database?

 

Cheers,

André

 

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Explorer

I have not created that part of the flow yet, but that will be an ExecuteStreamCommand processor which will run a psql command with the copy sql query:
It will look something like this:

baymax277_0-1655948726669.png

 

Master Collaborator

@baymax277 ,

 

You should look into using PutDatabaseRecord instead of ExecuteStreamCommand. This would be more efficient and it would eliminate the need to manually remove those brackets.

 

Cheers,

André

 

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Explorer

@araujo 

I will try this solution, but I think the queue to PutDatabaseRecord processor will get clogged very soon because there are over 14 million rows in this file.

I tried Replace text and I was able to replace '},' to '}' within seconds. Can you tell me how to remove the first line '[' and the last line ']'?

Thanks

Master Collaborator

@baymax277 ,

 

Try this:

araujo_0-1655951401398.png

This is the regex:

(?s)(^\[|\]$)

You may have to increase the buffer size to accommodate your file.

 

Cheers,

André

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.
; ;