Created on 10-18-2024 05:20 AM - edited 10-18-2024 05:37 AM
Hello!
The configuration of my SplitText is:
The task is to split one csv file:
id;description "1234";"The latitude is 12324.24" "2345";"12324.24 this value"
on 2 files:
id;description "1234";"The latitude is 12324.24"
and
id;description "2345";12324.24 this value"
But it returns 10000 and more duplicated files!
What am I doing wrong?
Created 10-18-2024 10:43 AM
Hi @AndreyDE ,
What's your input into the SplitFile processor?
I used your example and getting a valid output -
Make sure the file going into the SplitText is not re-reading the same file over and over again and also if you are using generateFlowFile make sure the scheduling isn't set to 0 sec because it will keep outputting a bunch of flowfiles.
Please accept this solution if it's correct, thanks!
Created 10-19-2024 12:42 PM
My SplitFile processor connected (follows by) ValidateRecords processor.
ValidateRecords use CSVReader with following configuration:
Input_schema is avro:
{ "type": "record", "name": "geo_data", "fields": [ { "name": "id", "type": [ "int", "null" ] }, { "name": "description", "type": [ "string", "null" ] } ] }
And the source of this pipeline in ListS3 and FetchS3 processors.
Created 10-19-2024 06:46 PM
@AndreyDE
Is one flowfile going into the SplitText processor and outputting 10000 flowfiles?
How big is the flowfile going into the SplitText processor?
Or is the source of the pipeline recursively getting all objects in your S3 bucket?
I need to a little bit more about the input going into SplitText?
Created 10-20-2024 01:45 AM
@drewski7 wrote:@AndreyDE
Is one flowfile going into the SplitText processor and outputting 10000 flowfiles?
Yes - one flow file
How big is the flowfile going into the SplitText processor?
About 30 KB
Or is the source of the pipeline recursively getting all objects in your S3 bucket?
Yes, it searches all objects recursively
Created 10-19-2024 07:49 PM
Hi @AndreyDE ,
The reason you are getting that many flowfiles is because you are continously running the upstream processor that gets the CSV input on 0 Secs Schedule . You seem to be new to Nifi and its typical beginner mistake. we all have been there :). By default the scheduling on every processor is set to 0 secs in earlier version, but in later releases to help avoid getting this issue the default has changed to 1 min. To fix this, if you are doing testing , I would stop the processor that generates\gets the CSV input and whenever you want to run a test you can right click and select "Run Once". If you are planning to run the flow as batch process where every time you are expecting to git a different file, then go the processor configuration , under Scheduling tab you can adjust the schedule accordingly by selecting either "Timer Schedule" or "Cron Schedule". For more info on scheduling please refer to the following:
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#scheduling-tab
https://www.youtube.com/watch?v=pZq0EbfDBy4
Hop that helps.