Support Questions

jnk32 · ‎04-10-2024

I have a url endpoint that provides a json file with a size of a couple of `GB`. Unfortunately the api does not support pagination which would be the normal approach to my problem.

So what i can do in python is to use ijson lib and split the json from the endpoint while receiving it and storing the result to my hard drive. This is very memory efficient and gives me the ability to run this async and start transforming the results while data are still loaded.

import ijson
import json
from urllib.request import urlopen

f = urlopen(url)
objects = ijson.items(f, 'item', use_float=True)
record = (o for o in objects)
for i,r in enumerate(record):
  print(i)
  with open(f'/tmp/streamwriter/{i}.json', 'w') as f:
    f.write(json.dumps(r))

now i want to do this in nifi. Is there a processor that can do the same. The way I understand the InvokeHttp Processor by now is that it has to receive the full payload before it sends the flowfile down stream.

----

Reference:

I asked the same questions on stackoverflow. But since I did not receive an answer there, i tried this forum.

DianaTorres · ‎04-11-2024

@jnk32 Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our NiFi experts @joseomjr @SAMSAL @mburgess who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.

Regards,

Diana Torres,
Community Moderator

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

Support Questions

using nifi for iterative json parsing from a given http stream