About indranil89

indranil89 · ‎11-03-2016

Background:We are using Apache NIFI data flow to move data local from to Hadoop based file systems.We are executing the NIFI processors by calling the NIFI rest API using groovy script wherein we use json builders in groovy to generate the Json and then passing the json to put methods to execute the processors. NIFI Version:0.6.0 While planning to migrate to NIFI 1.0.0 and using the same groovy script we are facing a few errors in the latest version of NIFI(1.0.0): Having controller/revision in nifi.get method does not return the response json, instead its throwing 404, verified in browser too. This works fine in 0.6.1 Nifi version. Reference : resp = nifi.get(path: 'controller/revision') This below does not work too, since having controller in path as a pretext to process-groups is no longer valid. It also returns a 404/bad request error.This works fine in 0.6.1 Nifi version Reference : resp = nifi.put( path: "controller/process-groups/$processGroup/processors/$processorId", body: builder.toPrettyString(), requestContentType: JSON ) PS: While trying to verify the below in browser, GET only responds when we have structures like /process-groups/{id} or /process-groups/{id}/processors , i.e. without controller string.This works fine in 0.6.1 Nifi version Reference: host://port/nifi-api/process-groups/root Below syntax does not work in script either. This works fine in 0.6.1 Nifi version resp = nifi.put( path: "process-groups/$processGroup/processors/$processorId", body: builder.toPrettyString(), requestContentType: JSON ) Since the syntax provided above works perfectly fine in 0.6.0 I would like to know if any changes are made in NIFI 1.0.0 in the rest API or in the way the various HTTP requests are passed to methods like 'get' and 'put'? I could not find any changes in the release notes or the NIFI API documentation provided in the link below: https://nifi.apache.org/docs/nifi-docs/rest-api/ Please let me know if you need any other information. Regards, Indranil Roy

indranil89 · ‎09-09-2016

@Bryan Bende Thanks for the input it really helped a lot in our case.Say I have 2 rows in my table 1|Indranil|ETL 2|Reporting|Joy I want to convert it to JSON so that I am able to insert each row into multiple cells in a single Hbase row. This is my converted JSON { "Personal": [ { "id":"1", "name":"Indranil", "Skill":"ETL" } , { "id":"2", "name":"Joy", "Skill":"Reporting" } ] } Is this JSON in the correct format to be consumed by the PutHBaseJSON. My end goal is to insert all the values in a row to different cell."Personal" refers to the "column family" and "id" refers to the "Row Identifier Field Name".

indranil89 · ‎09-08-2016

@Bryan Bende In such a scenario since we want to store the rows with different row ids is there a workaround possible?If I assume correctly using PutHBaseJSON might help.So is there any processors available to convert the pipe delimited source file into a JSON file to be consumed by the PutHbaseJSON processor to insert multiple values?

indranil89 · ‎09-08-2016

@Bryan Bende As you mentioned PutHBaseCell is used to write a single cell to HBase and it uses the content of the FlowFile as the value of the cell.Now if my input flowfile has say 50 lines of pipe separated values,will it insert all those rows into 50 cells with 50 different row id's or it will enter all the rows into same row?

indranil89 · ‎09-08-2016

I want to insert data into Hbase from a flowfile using NIFI. Does putHbaseCell supports Hbase tables with multiple column families.Say I have create an Hbase table with 2 column families cf1(column1,column2,column3) and cf2(column4,column5). How do I specify "Column Family" and "Column Qualifier" properties in the putHbaseCell configuration. Where do I specify the mapping between the flowfile(Text file with pipe comma separated values) and the Hbase table? The flowfile will have pipe separated columns.And I want to store a subset of columns into each column families. Regards, Indranil Roy

indranil89 · ‎09-07-2016

@mclark Thanks for your inputs.The above solution worked perfectly fine in my case both in terms of the error and performance.But as you already mentioned above in this situation we have a large number of files in the HDFS. Even if I use a MergeContent processor in the flow I am getting more than I files.For what I can understand by looking at the provenence the MergeContent processor is merging files in block.Say we have 100 flow files coming to the MergeContent processor batches of 30,30,20,20.If will not wait for 100 files and generate 4 output files by merging in groups.Is there a way by which we can control this behavior and enforce it to produce only 1 output files for each output path. mergecontent.png This is the configuration of MergeContent processor.Any inputs will be very helpful. Regards, Indranil Roy

indranil89 · ‎09-06-2016

@jwitt According to you the flow should look like GetFile->SplitFile->RouteText->PutHDFS Since we are using only a standalone cluster if we Split the file into 5000 splits do we need to do a UpdateAttribute/MergeContent after the RouteText processor or the flow shown above should be fine? Also do we need to set the "No of Concurrent Task" in all the processor(GetFile,putHDFS,splitText) or only the RouteText processor? Regards, Indranil Roy

indranil89 · ‎09-06-2016

@mclark Sure I will try that option.I can understand that I need to increase my split count in order to achieve better load balancing.But if I go back to the main issue I pointed in the thread above was that apart from the performance aspect we were getting only a subset of records in HDFS. It seems the process was trying to create the same file and overwrite it multiple times hence giving an error as shown below. When I use splitText processor and send it to the RPG and then merge it I am getting the error as shown(attached). 7300-afiuy.png Just to be on the same page here my flow looks like below: In the NCM of the cluster flow.png In the standalone cluster flow1.png Does increasing the splitCount solve this problem? Also is it necessary to use a MergeContent/UpdateAttribute if we use a splitText?Can't we achieve this flow without using the MergeContent/UpdateAttribute processor in the RPG? Regards, Indranil Roy

indranil89 · ‎09-06-2016

We have a data flow as shown above wherein we have a single pipe delimited source file around 15 GB having 50 million records. We are routing the rows into two different paths in HDFS based on routing condition as shown above in the RouteText configuration window. The following process in taking around 20 minutes to complete on a standalone server. The number of concurrent processors are set to 10 for all the processors. Is this performance exhaustive or there is any way to improve the performance further in this standalone server considering the server has 4 cores and 16 GB RAM. Also as I can observe most the processing time is consumed by the RouteText processor. Is this design suitable for this kind of use case to send the records of a pipe delimited file to different outputs based on some conditions since RouteText processes records line by line. We are using NIFI 0.6.1

indranil89 · ‎09-02-2016

@mclark Our RouteText processor is configured as shown below 7231-condition.png as it shows, records with (first field is record number) number <= 5000000 goes to one direction and records number >= 5100000 goes to another. The split Text processor is configured with the below properties: 7234-split-config-final.png Just to give an overview of our requirement: 1)We have a single file as source coming to a standalone server. 2)We fetch the file and then split it into multiple files and then send to the cluster in order to distribute the processing to all the nodes of the cluster. 3)In the cluster we route the files based on condition so that records with (first field is record number) number <= 5000000 goes to one output directory in HDFS and records number >= 5100000 goes to another output directory in HDFS as mentioned in the two putHDFS processors. 4)But after executing the process we have around 1000000 records in each output directory whereas ideally we should have 5000000 records approximately in either of the HDFS directory. Also we are getting below error in the PutHDFS processors 7300-afiuy.png Please let me know if you need any further information. Just to add to the above set up works perfectly fine when we are using a standalone node and we are using putFile instead of putHDFS to output the files to a local path instead of hadoop. Regards, Indranil Roy

Online	Offline
Last Visited	‎07-04-2016 11:20 AM

Member Since	‎06-09-2016 05:18 AM
Last Visited	‎07-04-2016 11:20 AM
Posts	48
Kudos received	10

Cloudera Community

Unable to update/execute processor though NIFI re...

Re: Insert into a HBase Table with multiple column...

Re: Insert into a HBase Table with multiple column...

Re: Insert into a HBase Table with multiple column...

Insert into a HBase Table with multiple column fam...

Re: Load balancing while the fetching of file fro...

Re: NIFI RouteText processor taking too long

Re: Load balancing while the fetching of file fro...

NIFI RouteText processor taking too long

Re: Load balancing while the fetching of file fro...