Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Help with large text file split -- records vs flow files

Help with large text file split -- records vs flow files

Explorer

Hello all,

I have a use case, where we have many large .gz files that contain very large amounts of unformulated log data split by new line.

If you look at the picture below the goal is to read all files, decompress, and send each line of the text log data to a kafka topic.

The main problem is the large amount of lines, where using the split text creates way too many flow files for the system to handle.

I have looked around for some suggestions -- one of them being the split interval where you split the text in smaller descending chunks, however that adds too much overhead in processing for our sue case.

I also read a bit on record processors but i have not been able to properly configure one as technically the data has no schema. I tried to be a little clever and assume the text is csv where the delimiter is "newline", however i got an error where the '\n' character is not permitted to be used as a delimiter in the csv processor.

Currently we are using Streamsets Data collector to do this use case, but if we can use nifi it would help us down the road.

I am asking whether anyone knows a way to and can show how to use the record processors for this use case.

Thank you!

54399-capture.png

Don't have an account?
Coming from Hortonworks? Activate your account here