Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What is a good approach for Spilitting 100GB file in to multiple files.?

Solved Go to solution

What is a good approach for Spilitting 100GB file in to multiple files.?

Super Collaborator

Hi ,

I have a 100 GB file while I will have to Split in to multiple (may be 1000) files depending on the value "Tagname" in the sample data below and write in to HDFS.

Tagname,Timestamp,Value,Quality,QualityDetail,PercentGood

ABC04.PI_B04_EX01_A_STPDATTRG.F_CV,2/18/2015 1:03:32 AM,627,Good,NonSpecific,100 ABC04.PI_B04_EX01_A_STPDATTRG.F_CV,2/18/2015 1:03:33 AM,628,Good,NonSpecific,100 ABC05.X4_WET_MX_DDR.F_CV,2/18/2015 12:18:00 AM,12,Good,NonSpecific,100 ABC05.X4_WET_MX_DDR.F_CV,2/18/2015 12:18:01 AM,4,Good,NonSpecific,100 ABC04.PI_B04_FDR_A_STPDATTRG.F_CV,2/18/2015 1:04:19 AM,3979,Good,NonSpecific,100 ABC04.PI_B04_FDR_A_STPDATTRG.F_CV,2/18/2015 9:35:23 PM,4018,Good,NonSpecific,100 ABC04.PI_B04_FDR_A_STPDATTRG.F_CV,2/18/2015 9:35:24 PM,4019,Good,NonSpecific,100

In reality the "Tagname" will be continues to be the same(may be 10K+) until its value changes. I need to create one file for each Tag.

Do i have to split the file in to smaller files (may be 20 , 5GB files) using SplitFile.? If i do that will it split exactly at the end of lines.? Do I have to read line by line using ExtractText or any better approach.?

Can i use ConvertCSVToAvro and then ConvertAVROToJson and then split the Json file by Tag using SplitJson..??

Can i use do i have to change any default NiFi settings for this.?

Regards,

Sai

1 ACCEPTED SOLUTION

Accepted Solutions

Re: What is a good approach for Spilitting 100GB file in to multiple files.?

Master Guru
@Saikrishna Tarapareddy

I agree that you may still need to split your very large incoming FlowFile into smaller FlowFiles to better manage heap memory usage, but you should be able to use the RouteText and ExtractText as follows to accomplish what you want:

8896-screen-shot-2016-10-26-at-30538-pm.png

RouteText configured as follows:

8897-screen-shot-2016-10-26-at-30630-pm.png

All Grouped lines will be routed to relationship "TagName" as a new FlowFile.

They feed into an ExtractText configured as follows:

8898-screen-shot-2016-10-26-at-30704-pm.png

This will extract the TagName as an attribute of on the FlowFile which you can then use as the correlationAttribute name in the MergeContent processor that follows.

Thanks,

Matt

13 REPLIES 13

Re: What is a good approach for Spilitting 100GB file in to multiple files.?

Master Guru

@Saikrishna Tarapareddy

You may consider using the RouteText processor to route the individual lines from your source FlowFile to relationships based upon your various Tagnames and then use mergeContent processors to merger those lines back in to a single FlowFile.

Re: What is a good approach for Spilitting 100GB file in to multiple files.?

Also you should use a series of SplitText processors in a row rather than one, the first could split into 100,000 rows or something, then the next to 1000, then the next to 1. Those numbers (and the number of SplitTexts) can be tuned for your dataset, but should prevent any single processor from hanging or running out of memory.

Re: What is a good approach for Spilitting 100GB file in to multiple files.?

Master Guru

Same hold true for the MergeContent side of this flow. Have a MergeContent merger the first 10,000 FlowFiles and a second merger multiple 10,000 line FlowFiles into even larger merged FlowFiles. This again will help prevent running in to OOM errors.

Re: What is a good approach for Spilitting 100GB file in to multiple files.?

Super Collaborator

Hi @mclark,

There will be thousands of tagnames do I have to specify those as relationships.? I hope you are talking about the following way which I tried on a smaller file. which it split in to 5 files based on the grouping reg ex. but it routed all the split files in to unmatched relationship. I think that should be fine I can drive my remaining process from Unmatched.

now will it work if the file is huge (100GB).? does it need to read the whole file before it splits based on groups.?8909-routetext.png

routeText properties:

8910-routetext2.png

Re: What is a good approach for Spilitting 100GB file in to multiple files.?

Super Collaborator

@mclark @Matt Burgess,

sorry , I didn't see your reply.

may be I should use the above approach with splitText. So it would be combination of multiple splitText and RouteText processes.?? how do I extract the tagName (first column) from the text so that I can use that for Merge process.?

Regards,

Sai

Re: What is a good approach for Spilitting 100GB file in to multiple files.?

Multiple SplitTexts just to get the size of each flow file down to a manageable number of lines (not 1 as I suggested above, but not the whole file either), then RouteText with the Grouping Regular Expression the way you have it, then multiple dynamic properties (similar to your TagName above), each with a value of what you want to match:

Tag1 with value ABC04.PI_B04_EX01_A_STPDATTRG.F_CV

Tag2 with value ABC05.X4_WET_MX_DDR.F_CV

...etc.

Once you Apply the changes and reopen the dialog you should see relationships like Tag1 and Tag2, you can then route those relationships to the appropriate branch of the flow. In each branch, you may need multiple MergeContents like @mclark describes above, to incrementally build up larger files. At the end of each branch, you should have a flow file full of entries with the same tag name.

An alternative is to use SplitTexts down to 1 flow file per line, then ExtractText to put the tag name in an attribute, then RouteOnAttribute to route the files, then the MergeContents to build up a single file with all the lines with the same tag name. This seems slower to me, so I'm hoping the other solution works.

Re: What is a good approach for Spilitting 100GB file in to multiple files.?

Super Collaborator

@Matt Burgess ,

if I have 5K tags it would be difficult to write dynamic properties for all .I am trying to see how I can get the Tag name dynamically from RouteText , the one that I wrote above is not doing it.

Re: What is a good approach for Spilitting 100GB file in to multiple files.?

Master Guru
@Saikrishna Tarapareddy

I agree that you may still need to split your very large incoming FlowFile into smaller FlowFiles to better manage heap memory usage, but you should be able to use the RouteText and ExtractText as follows to accomplish what you want:

8896-screen-shot-2016-10-26-at-30538-pm.png

RouteText configured as follows:

8897-screen-shot-2016-10-26-at-30630-pm.png

All Grouped lines will be routed to relationship "TagName" as a new FlowFile.

They feed into an ExtractText configured as follows:

8898-screen-shot-2016-10-26-at-30704-pm.png

This will extract the TagName as an attribute of on the FlowFile which you can then use as the correlationAttribute name in the MergeContent processor that follows.

Thanks,

Matt

Re: What is a good approach for Spilitting 100GB file in to multiple files.?

Master Guru
@Saikrishna Tarapareddy

You can actually cut the ExtractText processor out of this flow. I forgot the RouteText processor generates a "RouteText.Group" FlowFile attribute. You can just use that attribute as the "Correlation Attribute Name" in the MergeContent processor.

8899-screen-shot-2016-10-26-at-33212-pm.png

Don't have an account?
Coming from Hortonworks? Activate your account here