About pvillard

pvillard · ‎09-02-2016

Could you share the information you will find in the application log file? (./logs/nifi-app.log)

pvillard · ‎09-02-2016

Thanks a lot! It works like a charm.

pvillard · ‎09-01-2016

Correct. As I said you can see what is generated by starting a processor to have flow file generated but not consumed by the next processor. Then list queue Then click on the Info button to have information displayed about the flow file: And you can even see the content of the flow file or download it. The GenerateFF only generates what we call core attributes such as UUI (to uniquely identify a flow file), filename, path, etc. Regarding the ReplaceText processors, this is not true, here are the configurations: ${now()}|17${now():toNumber():mod(9):toString()}.1.${now():toNumber():mod(25):toString()}.${now():toNumber():mod(255):toString()}|DE|${nextInt():mod(2):toString()} ${now()}|17${now():toNumber():mod(9):toString()}.1.${now():toNumber():mod(25):toString()}.${now():toNumber():mod(255):toString()}|ITA|${nextInt():mod(2):toString()} ${now()}|17${now():toNumber():mod(9):toString()}.1.${now():toNumber():mod(25):toString()}.${now():toNumber():mod(255):toString()}|USA|${nextInt():mod(2):toString()} ${now()}|17${now():toNumber():mod(9):toString()}.1.${now():toNumber():mod(25):toString()}.${now():toNumber():mod(255):toString()}|IND|${nextInt():mod(2):toString()} ${now()}|17${now():toNumber():mod(9):toString()}.1.${now():toNumber():mod(25):toString()}.${now():toNumber():mod(255):toString()}|FR|${nextInt():mod(2):toString()} For the purpose of the tutorial we want to generate random logs from different countries, hence the multiple processors.

pvillard · ‎09-01-2016

Hi, This is now available on the left panel. This is because of the new multitenancy and ACL improvements. Hope this helps.

pvillard · ‎09-01-2016

Flow files are made of 'attributes' and 'content'. GenerateFF generates random flow files with content (or not if you don't want to). This is generally used to generate data to make start your flow but also mainly used for demonstration and test purpose. The ReplaceText processor only replaces content and is not modifying the attributes. Why five processors, simply to have generated the different part of the simulated logs you want to process. Just have a look at the configuration of each processor. You can also start a processor but not starting the next one in the flow. This will queue up flow files in the relationship. By right clicking on the relation, then lgoing to list, you will be able to see properties of each flow files as well as content. I'm sure this will help you understand the why and how.

pvillard · ‎09-01-2016

I'd recommend you starting bu reading the documentation about the philosophy behind NiFi as well as the documentation of each processor you are mentioning. This will explain you the concept of flow files, repository, flows, content vs attributes, etc. http://nifi.apache.org/docs.html

pvillard · ‎08-30-2016

OK, so I gave it a quick try. If some groovy gurus have feedbacks, don't hesitate. I have simulated the following DF (template split-execute-script.xml😞 GenerateFlowFile -> ReplaceText -> ExecuteScript -> PutFile GenerateFlowFile and ReplaceText are just used to generate flow files respecting your requirements. The ExecuteScript has the following body: import org.apache.commons.io.IOUtils import java.nio.charset.* def flowFile = session.get() if (!flowFile) return flowFile = session.write(flowFile, {inputStream, outputStream -> inputStream.eachLine { line, count -> def columns = line.split("\\|") outputStream.write((columns[0] + "," + columns[1] + "," + columns[7] + "," + columns[8] + "\n").getBytes(StandardCharsets.UTF_8)) } } as StreamCallback) session.transfer(flowFile, REL_SUCCESS) It may exist a better version of this code but it does the job. I let you try it with your data just to confirm but I think this will fulfill your performance expectations. If dealing with huge files, you may want to first split your data into small chunks and then merge the data back in order to leverage data balancing (in cluster configuration) and multithreading. Let me know if you have any question.

pvillard · ‎08-30-2016

@boyer In this case, you can set the 'Max Bin Age' property so that after a given amount of time the merging process occurs even if the group size condition is not met. Hope this helps.

pvillard · ‎08-30-2016

@sam coderunner Your comments are correct. I was not expecting such a performance degradation but you are probably right that 21 columns are not helping. I will try to perform some tests on my side to check if performances can be improved. But clearly, I do agree with you: in such a case I believe that ExecuteScript processor would be a better fit to solve the issue. It's really easy to write some lines of groovy (for example) to perform what you are looking for. Let me know if you need any help on this.

pvillard · ‎08-29-2016

@Alvin Ji, This is correct with NiFi 0.x. Unless you implement your own MapCacheServer service and separate it from NiFi, I am not sure there is a solution. With NiFi 1.x (first version to be released in coming days, RC vote in progress), this is solved with a zero-master clustering paradigm.

Online	Offline
Last Visited	‎07-30-2024 08:59 AM

Member Since	‎04-11-2016 09:20 AM
Last Visited	‎07-30-2024 08:59 AM
Posts	471
Kudos received	325

Cloudera Community

Re: ValidateRecord doesn't maintain column order?

Re: For NiFi S2S, is it better to us load balancer...

Re: How to Limit Number of Threads for Each Proces...

Re: Once YARN queue is at capacity, running jobs s...

Re: putHiveQL error

Re: nifi -Ingesting a file from SFTP and insert i...

Re: Sqoop Hortonworks TDCH - Queryband option

Re: understanding the NIFI example project

Re: Cannot Import Templates into NiFi 1.0.0

Re: understanding the NIFI example project

Re: understanding the NIFI example project

Re: FlowFiles getting queued before NIFI ReplaceTe...

Re: How to set the minimum group size of MergeCont...

Re: FlowFiles getting queued before NIFI ReplaceTe...

Re: nifi detect duplicate MapCacheServer Error