Support Questions

AndreyDE · ‎10-18-2024

Hello!

The configuration of my SplitText is:

The task is to split one csv file:

   id;description
   "1234";"The latitude is 12324.24"
   "2345";"12324.24 this value"

on 2 files:

   id;description
   "1234";"The latitude is 12324.24"

and

   id;description
   "2345";12324.24 this value"

But it returns 10000 and more duplicated files!

What am I doing wrong?

drewski7 · ‎10-18-2024

Hi @AndreyDE ,

What's your input into the SplitFile processor?

I used your example and getting a valid output -

Make sure the file going into the SplitText is not re-reading the same file over and over again and also if you are using generateFlowFile make sure the scheduling isn't set to 0 sec because it will keep outputting a bunch of flowfiles.

Please accept this solution if it's correct, thanks!

AndreyDE · ‎10-19-2024

My SplitFile processor connected (follows by) ValidateRecords processor.

ValidateRecords use CSVReader with following configuration:

Input_schema is avro:

{
 "type": "record",
 "name": "geo_data",
 "fields": [
    {
      "name": "id",
      "type": [
         "int",
         "null"
       ]
    },
    {
      "name": "description",
        "type": [
         "string",
         "null"
       ]
    }
   ]
  }

And the source of this pipeline in ListS3 and FetchS3 processors.

drewski7 · ‎10-19-2024

@AndreyDE

Is one flowfile going into the SplitText processor and outputting 10000 flowfiles?

How big is the flowfile going into the SplitText processor?

Or is the source of the pipeline recursively getting all objects in your S3 bucket?

I need to a little bit more about the input going into SplitText?

AndreyDE · ‎10-20-2024

@drewski7 wrote:
@AndreyDE

Is one flowfile going into the SplitText processor and outputting 10000 flowfiles?

Yes - one flow file

How big is the flowfile going into the SplitText processor?

About 30 KB

Or is the source of the pipeline recursively getting all objects in your S3 bucket?

Yes, it searches all objects recursively

SAMSAL · ‎10-19-2024

Hi @AndreyDE ,

The reason you are getting that many flowfiles is because you are continously running the upstream processor that gets the CSV input on 0 Secs Schedule . You seem to be new to Nifi and its typical beginner mistake. we all have been there :). By default the scheduling on every processor is set to 0 secs in earlier version, but in later releases to help avoid getting this issue the default has changed to 1 min. To fix this, if you are doing testing , I would stop the processor that generates\gets the CSV input and whenever you want to run a test you can right click and select "Run Once". If you are planning to run the flow as batch process where every time you are expecting to git a different file, then go the processor configuration , under Scheduling tab you can adjust the schedule accordingly by selecting either "Timer Schedule" or "Cron Schedule". For more info on scheduling please refer to the following:

https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#scheduling-tab

https://www.youtube.com/watch?v=pZq0EbfDBy4

Hop that helps.

Cloudera Community

Support Questions

Nifi SplitText split file with 2 records to 10000 and more files