Support Questions

awhite4844 · ‎07-07-2017

Hi. I have a csv file and a libsvm file i would like to take line by line and add into a dataflow. I have added the tailfile to a control rate processor and the a split text processor. I am not recieving errors but nothing is moving. Does Tailfile work on these filetypes. Thank you

MattWho · ‎07-07-2017

@adrian white

The tailFile processor is designed to tail a file and ingest new lines as they are written to that file. IN your case you have static files that are not being written to. You will want to use the GetFile processor to in gets these complete files before using the splitText processor to break them apart in 1 line per new FlowFile.

Thanks, Matt

View solution in original post

MattWho · ‎07-07-2017

@adrian white

The tailFile processor is designed to tail a file and ingest new lines as they are written to that file. IN your case you have static files that are not being written to. You will want to use the GetFile processor to in gets these complete files before using the splitText processor to break them apart in 1 line per new FlowFile.

Thanks, Matt

awhite4844 · ‎07-11-2017

Hi @Matt Clarke

Thanks for the reply. Even when i use the getfile processor, it seems to hang on the split text processor.

The data seems to queue before the splittext processor. There is a header in the csv file so header line count is set to 1 in the processor. Is there something else i am missing. The csv file is 90mb with 32 columns. Thanks

MattWho · ‎07-11-2017

@adrian white

At 90 MB, I suspect that CSV file has a lot of lines to split. Are you seeing any Out Of Memory errors in your nifi-app.log? To help reduce the heap usage here, you may want to try using two splitText processor in series. The first splitting every 1,000 - 10,000 lines and the second then splitting those by every line. NiFi FlowFile attributes are kept in heap memory space. NiFi has a mechanism for swapping FlowFile attributes to disk for queues, but this mechanism does not apply to processors. The SplitText processor holds the FlowFile attributes for every new FlowFile it is creating in heap until all resulting Split FlowFiles have been created. When splitting creates a huge number of resulting FlowFiles in a single transaction, you can run out of heap space. So by splitting the job between multiple splitText processors in series, you reduce the number of FlowFiles that are being generated per transaction thus decreasing heap usage.

Thanks,

Matt

awhite4844 · ‎07-11-2017

Hi Matt.

Yes that was the issue, thank you for your help.

Cloudera Community

Support Questions

Problem Using TailFile on a csv file and libsvm file