About indranil89

indranil89 · ‎09-02-2016

Thanks @mclark for your input. As suggested we implemented the data flow as shown below: 1)The standalone flow(This is where the single source file arrive).The RPG in this flow refers to the cluster i.e NCM url 2)The NCM of the cluster has flow as shown below: 3)In this approach were facing the error as shown below as getting only a subset of the records in HDFS 4)So to avoid the situation we used a MergeContent processor to merge the flowfiles since we were splitting them before loading them to HDFS 5)We configured the MergeContent in the way as shown below: But even after this implementation we are not getting all the records in HDFS. The source file has 10000000 records and approximately 5000000 records should go to each HDFS directory. But we are getting around 1000000 records in each Target and the error as shown below in the PutHDFS processors. We are getting the same error as mentioned in the snapshot attached with the point 3 above. Are we missing something very intrinsic here? Is there something wrong with the design? We are using a 3 node cluster with NCM and 2 slave nodes. And the source file is coming to a standalone server. Let me know if you need any other information. Any inputs would be appreciated. Regards, Indranil Roy

indranil89 · ‎08-31-2016

@mclark We have a single large(in TB's) flowfile coming to a standalone node.We want to distribute the processing.Is it a good approach to split the file into multiple smaller files using SplitText processor so that the processing is distributed to the remaining clusters. In such a case we are considering the the flow given below: In the NCM of the cluster input->RouteText->PutHDFS In the standalone processor that has the incoming flow file ListFile->FetchFile->SplitText->UpdateAttribute->RPG(NCM url) Does this set up ensure the processing to be distributed?

indranil89 · ‎08-29-2016

Hi @Pierre Villard We are using NIFI 0.6.0 To answer your second question a subset of the whole file is coming to the ReplaceText processor based on the condition in the RouteText processor.Say we have 100 records and some 50 records satisfies the condition so 50 records will come to the processor. The regular expression we are using is (.+)\|(.+)\|(.+).... Where (.+) is repeated n of times based on the number of columns in the flowfile. So as per your observation we should be using ^(.+)\|(.+)\|(.+)....$ Any other suggestion to improve the performance?

indranil89 · ‎08-29-2016

We have a source file with pipe delimited rows and we need to fetch specific columns from the flow file. We are using a regular expression in a replaceText processor to extract the columns.The flow we are using is as shown ListFile->FetchFile->Routetext->ReplaceText->PutFile The source file we are using has some 21 columns and around 100000 records.The file size is around 25 MB.As soon as I start the processor the records are getting queued before the replaceText processor and the job is running indefinitely.In fact even after stopping the job we are unable to empty the queue or even delete any processor for that matter. The Replace Text processor is configured as shown below: I have increased the Maximum buffer size to 10 MB(1 MB Default) but still it is of no use. Considering there are only 100000 records in the file(25 MB) this should not take so long? Is there anything wrong with the configuration or the way we are using the flow? Any inputs would be very helpful. The system we are using has 16 GB RAM and 4 cores. Regards, Indranil Roy

indranil89 · ‎08-25-2016

@mclark 1)We are talking about a single file in TB. 2)There is a single file and the processing should be distributed. 3)The file are in the local directory. So is it a good idea?

indranil89 · ‎08-25-2016

Hi @mclark Thanks for the alternate approach that you suggested.It could be helpful in my case. Say in the scenario mentioned above we have a single input file of size in the order of TB's.If we use a ListSFTP/FetchSFTP processor in the way you mentioned to distribute the fetching of data: Do we need to establish a SFTP channel between every slave node of the cluster and the remote server that houses the source file for this approach to work? Is it a good idea to use SFTP to fetch the file considering the size of the file will be in TB's? What are the parameters on which the the performance of the fetch using ListSFTP/FetchSFTP will depend?

indranil89 · ‎08-25-2016

Thanks @Bryan Bende for your input As per your suggestion I have included two different flows as given below: Flow1 ===== In the standalone instance ListFile->FetchFile->Outputport In the NCM of the cluster RPG(Standalone NIFI Instance)-->Routetext->PutHDFS(This is the processing done in the main cluster) Flow2 ===== In the standalone instance ListFile->FetchFile->Input port of RPG(NIFI Cluster URL of the NCM) In the NCM of the cluster Input port->RouteText->OutputPort Which according to you is the correct flow.I can understand that the fetching of the source file cannot be distributed if the file is not shared but which of the flows will be apt to distribute the part of the processing done inside the cluster? Regards, Indranil Roy

indranil89 · ‎08-24-2016

We have a NIFI setup where in we have a NIFI cluster installed in a hadoop cluster and a standalone NIFI instance running on another server.The input file will be generated in the file system of the server of the standalone instance.We are fetching the file using ListFile/FetchFile processor in the standalone instance.Then in the main cluster we are connecting to the standalone instance using RPG group and then send the output to the NIFI cluster(RPG) using site-site.As per my understanding the part of the processing done inside the cluster will be distributed.Is this understanding correct?I also would like to know if there is a way to distribute the fetching of the source file that we are doing in the standalone NIFI instance? The flow we are using In the standalone instance ListFile->FetchFile->Outputport In the NCM of the cluster RPG(Standalone NIFI Instance)-->RPG(NIFI cluster) Inputport->Routetext->PutHDFS(This is the processing done in the main cluster) Let me know if you need any other information.Any inputs will be appretiated. Regards, Indranil Roy

indranil89 · ‎08-11-2016

Thanks a lot.It works.

indranil89 · ‎08-10-2016

It was in stopped state.As I already pointed out I was able to update other properties.This is specific to "Search Value" property in replaceText processor.

Online	Offline
Last Visited	‎07-04-2016 11:20 AM

Member Since	‎06-09-2016 05:18 AM
Last Visited	‎07-04-2016 11:20 AM
Posts	48
Kudos received	10

Cloudera Community

Re: Load balancing while the fetching of file fro...

Re: Load balancing while the fetching of file fro...

Re: FlowFiles getting queued before NIFI ReplaceTe...

FlowFiles getting queued before NIFI ReplaceText p...

Re: Load balancing while the fetching of file fro...

Re: Load balancing while the fetching of file fro...

Re: Load balancing while the fetching of file fro...

Load balancing while the fetching of file from a ...

Re: Unable to update "Search Value" property in re...

Re: Unable to update "Search Value" property in re...