About pvillard

pvillard · ‎09-06-2017

Another option, without modifying your current workflow, is to configure your ConvertAvroToORC processor to use parallel threads. To do that, you can change the "concurrent tasks" parameter in the "scheduling" tab of the configuration.

pvillard · ‎09-06-2017

Hi @Aaron Dunlap, Depending of the HDF version you are using, you could leverage the record-oriented processors to perform the CSV - Avro conversion in a much more efficient way. Then I assume you're doing a conversion into ORC format to query the data using Hive. If that's the case, a common pattern is to let Hive do the conversion: what I usually do is to send the data into a landing folder in HDFS as Avro data, then I use a PutHiveQL processor to execute few queries (one to create a temporary table on top of the avro data using the corresponding avro schema, one to insert select the data from the temporary table to the final table which is ORC, and one to delete the temporary table), and then a DeleteHDFS processor to delete the data used to create the temporary table (because the drop table statement does not delete the data if you created a temporary external table). There is an ORC reader/writer on the roadmap that will replace all of that (you'll be able to directly convert from CSV to ORC using record-oriented processors) but that's not ready yet. Hope this helps.

pvillard · ‎06-22-2017

Hi @Raj B, I'd certainly recommend you to use multiple successive MergeContent processors instead of one. If your trigger is the size: you want to end with a file of 100MB, then I'd use a first MergeContent to merge small files into files of 10MB and then another one to merge into one file of 100MB. That's a typical approach for MergeContent and SplitText processors to avoid such issues. Hope this helps.

pvillard · ‎06-22-2017

Hi @regie canada, The second message is probably due to the fact that the processor cannot be started. You should have more details regarding the "why" in nifi-app.log file. I suspect that the port could already be in use on the host. I see you are talking about ListenTCP although your screenshots show ListenSyslog, are you sure you don't have multiple ListenX processors listening on the same port? 10k events seconds should not be an issue at all (it depends of the size of the events obviously but I guess we are talking logs and you should be good). Hope this helps.

pvillard · ‎06-21-2017

If it's LDAP, then you should use SIMPLE and you can ignore the TLS properties.

pvillard · ‎06-20-2017

The XML path must follow the following requirements: http://commons.apache.org/proper/commons-configuration/userguide/howto_hierarchical.html I think that's doable. Not sure this is the best approach if you have 100s of input directories though. If you have one input directory for one output directory, is there a way to compute the destination directory based on the path of the input directory? Could be easier to use expression language on the input directory to define the output one.

pvillard · ‎06-20-2017

Looks like your LDAP configuration is incorrect. Is it LDAPS or LDAP? It seems to be an error related to SSL/TLS parameters.

pvillard · ‎06-20-2017

First of all, you don't need to use both GetFile and FetchFile. GetFile is fine, but if you want to use FetchFile, it must be used in combination with ListFile. See article about List/Fetch pattern. Then you want to send the path in the flow file attributes, not in the content. And there is a slash missing at the beginning of your XPath expression. And now I realized that I misunderstood what you are trying to achieve. I didn't understand that you have two different files with one containing the destination path. I thought it was one single file. So... basically, what I suggested is not going to be OK. But just in case, here is a template with what I had in mind. xpath.xml Now let's focus on your use case. 🙂 You want to use a Lookup controller service that points to your configuration file. Then you can reference your controller service into a LookupAttribute processor that will extract the value from your configuration file and that will set it as an attribute of your flow file. Then the flow becomes: listFile, FetchFile, LookupAttribute, PutFile. Here is a template that should fulfill your requirements (just change the paths as needed). Don't forget that controller services are defined at process group level. Also note, if I'm correct, that this template requires latest version of NiFi to get it working. xmllookup.xml

pvillard · ‎06-20-2017

Hi MB, Yes, Zookeeper is used by a lot of components (for High Availability purpose), not only HBase. It is a mandatory and vital component.

pvillard · ‎06-20-2017

Hi @Pavan Challa, I'd recommend to use the EvaluateXPath processor: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.EvaluateXPath/index.html You can use the following XPath parameter: /config/path Extract it and put it as an attribute of your flow file and then you can use the way you want in the following steps. Hope this helps.

Online	Offline
Last Visited	‎07-30-2024 08:59 AM

Member Since	‎04-11-2016 09:20 AM
Last Visited	‎07-30-2024 08:59 AM
Posts	471
Kudos received	325

Cloudera Community

Re: ValidateRecord doesn't maintain column order?

Re: For NiFi S2S, is it better to us load balancer...

Re: How to Limit Number of Threads for Each Proces...

Re: Once YARN queue is at capacity, running jobs s...

Re: putHiveQL error

Re: In NiFi, the ConvertAvroToORC processor is ext...

Re: In NiFi, the ConvertAvroToORC processor is ext...

Re: NiFi error - Too many open files

Re: Nifi ListenTCP Processor Error

Re: NiFi User Authentication with LDAP issue

Re: NiFi - How to read values from XML and use in ...

Re: NiFi User Authentication with LDAP issue

Re: NiFi - How to read values from XML and use in ...

Re: Do I need zookeeper ?

Re: NiFi - How to read values from XML and use in ...