Member since
03-12-2024
2
Posts
2
Kudos Received
0
Solutions
04-10-2024
11:58 PM
1 Kudo
I have a url endpoint that provides a json file with a size of a couple of `GB`. Unfortunately the api does not support pagination which would be the normal approach to my problem. So what i can do in python is to use ijson lib and split the json from the endpoint while receiving it and storing the result to my hard drive. This is very memory efficient and gives me the ability to run this async and start transforming the results while data are still loaded. import ijson
import json
from urllib.request import urlopen
f = urlopen(url)
objects = ijson.items(f, 'item', use_float=True)
record = (o for o in objects)
for i,r in enumerate(record):
print(i)
with open(f'/tmp/streamwriter/{i}.json', 'w') as f:
f.write(json.dumps(r)) now i want to do this in nifi. Is there a processor that can do the same. The way I understand the InvokeHttp Processor by now is that it has to receive the full payload before it sends the flowfile down stream. ---- Reference: I asked the same questions on stackoverflow. But since I did not receive an answer there, i tried this forum.
... View more
Labels:
- Labels:
-
Apache NiFi
04-10-2024
11:49 PM
1 Kudo
I use the PutIcebergProcessor to write data to my data lake. Therefore I need to specify a HiveCatalogService. This Service needs HadoopConfigurationResources. This parameter is a path to an xml file containing the credentials to the S3 where the Iceberg files are stored. My Problem with this, that some content of this file is supposed to be secret to the users interacting with the nifi ui. However, as soon as a UI user knows this path, he can simply use ExecuteProcess Processor to retrieve those information. Is there any way to keep those information safe? Reference: HiveCatalogService PutIceberg
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Iceberg
-
Apache NiFi