About mburgess

mburgess · ‎11-03-2016

How do you know which delimiter is used for a particular file? If you can determine that from the content, you might be able to use RouteContent to send all \u0001-delimited files to one ConvertCSVToAvro (using the technique I describe above), all \u0002 files to another, and so on. Likewise if you can somehow extract the delimiter into an attribute you can use RouteOnAttribute rather than RouteContent. Why would you like to avoid ReplaceText? The content of the flow files will be altered when converting to Avro, so you won't have the original input at that point. If it is a performance issue, do you think my suggestion above would work for your use case?

mburgess · ‎11-02-2016

On a Mac, I found a useful procedure here for enabling the pasting of a Unicode character into the current text box. Using this, I opened the ConvertCSVToAvro processor dialog, then the CSV Delimiter property value dialog. Then using the procedure I selected character \u0001, which pastes it into the property (although it is not a visible character so you won't see it on the screen). Click OK then Apply and the delimiter should be set to \u0001. I tried this with a simple example and it worked. On Windows I think you can use the Character Map or something similar, but the idea is to either have some utility copy a unicode character to the clipboard for pasting into the property value dialog, or perhaps it will paste for you (like the Mac utility). Once NIFI-2369 is resolved, there might be a way to use Expression Language to make this more visible, like ${literal('\u0001')} or something. Alternatively, you could use a scripting processor like ExecuteScript and do the split with code (Javascript, Groovy, e.g.)

mburgess · ‎11-02-2016

In your snippet above, the property set to "KERBEROS" is "hive.server2.authentication", not "hadoop.security.authentication". If "hadoop.security.authentication" is set to "kerberos" in your core-site.xml, ensure the path to your core-site.xml is in the Hive Configuration Resources property. That property accepts a comma-separated list of files, so you can include your hive-site.xml (as you've done in your above screenshot) as well as the core-site.xml file (which has the aforementioned property set).

mburgess · ‎11-02-2016

The Hive processors share some code with the Hadoop processors (in terms of Kerberos, etc.), they expect "hadoop.security.authentication" to be set to "kerberos" in your config file(s) (core-site, hive-site, e.g.)

mburgess · ‎11-02-2016

QueryDatabaseTable does indeed treat DECIMAL and NUMERIC types as strings in the outgoing Avro, there is a Jira (NIFI-2624) to improve the handling of these types. In the meantime, you might be able to use ConvertAvroSchema, but you won't be able to support BigDecimal values there either; it only supports conversion to/from int, long, double, and float. If your values fit in a double, that might work for now.

mburgess · ‎10-28-2016

Do they need to be separate fetches? If you use a single ExecuteSQL with JOINs for the foreign keys, you can get a single result set (in Avro), then use ConvertAvroToJSON to convert to a single JSON object. If they must be in different flow files, there is currently no "MergeJSON" processor, although that would be a great contribution if you're interested in writing a full processor. An alternative is to use ExecuteScript or InvokeScriptedProcessor. In either case, keep in mind that NiFi employs a flow-based paradigm, so merging arbitrary incoming flow files can be tricky. This is done in some splitting processors (such as SplitText) by setting "fragment.id", "fragment.count", and "fragment.index" attributes on the flow files, so a downstream "merging" processor can handle these micro-batches by merging together all files with the same fragment.id. I've got an example of this kind of merging processor as a pull request for NIFI-2735. If you're using a scripting processor and just want to solve this one specific issue, you could assume that you will only get 3 incoming flow files and merge them accordingly. This is fragile but could work for your use case.

mburgess · ‎10-27-2016

You've got a single SplitText in your example, you might find better performance out of multiple SplitTexts to reduce the size incrementally, maybe 100,000 lines is too many, perhaps split into 10,000 then 100 (or 1, etc.) If you have a single file in FetchFile, you won't see performance improvements with multiple tasks unless you break down the huge file as described. Otherwise, with such a large single input file, you might see a bottleneck at some point due to the vast amount of data moving through the pipeline, and multi-threading will not help with a single SplitText if you have a single input. With multiple SplitTexts (and the "downstream" ones having multiple threads/tasks), you may find some improvement in throughput.

mburgess · ‎10-26-2016

Multiple SplitTexts just to get the size of each flow file down to a manageable number of lines (not 1 as I suggested above, but not the whole file either), then RouteText with the Grouping Regular Expression the way you have it, then multiple dynamic properties (similar to your TagName above), each with a value of what you want to match: Tag1 with value ABC04.PI_B04_EX01_A_STPDATTRG.F_CV Tag2 with value ABC05.X4_WET_MX_DDR.F_CV ...etc. Once you Apply the changes and reopen the dialog you should see relationships like Tag1 and Tag2, you can then route those relationships to the appropriate branch of the flow. In each branch, you may need multiple MergeContents like @mclark describes above, to incrementally build up larger files. At the end of each branch, you should have a flow file full of entries with the same tag name. An alternative is to use SplitTexts down to 1 flow file per line, then ExtractText to put the tag name in an attribute, then RouteOnAttribute to route the files, then the MergeContents to build up a single file with all the lines with the same tag name. This seems slower to me, so I'm hoping the other solution works.

mburgess · ‎10-26-2016

Also you should use a series of SplitText processors in a row rather than one, the first could split into 100,000 rows or something, then the next to 1000, then the next to 1. Those numbers (and the number of SplitTexts) can be tuned for your dataset, but should prevent any single processor from hanging or running out of memory.

mburgess · ‎10-26-2016

As @Bryan Bende has said, it isn't possible with those processors and/or the framework. However, you could emulate this part of the flow with something like ExecuteScript, but you'd be responsible for all the work (reading in the JSON, splitting it, getting the fields out into attributes). Groovy for example has a JsonSlurper which reads in the JSON to an object, at that point you could access the array (using object notation not JSON path), call each(), then further access the members (again using object notation) and set flow file attributes accordingly.

Online	Offline
Last Visited	‎10-29-2025 03:45 PM

Member Since	‎11-16-2015 02:21 PM
Last Visited	‎10-29-2025 03:45 PM
Posts	905
Kudos received	658

Cloudera Community

Re: Compare data within the JSON using NIFI

Re: how to join three csv files like sql on condit...

Re: How to see the Data Provenance and Lineage in ...

Re: Apache NiFi - RouteText has no matches

Re: Nifi Building error when creating a brand new ...

Re: Apache Nifi processor to convert 'Control A' (...

Re: Apache Nifi processor to convert 'Control A' (...

Re: Cannot Create Hive Connection Pool.

Re: Cannot Create Hive Connection Pool.

Re: QueryDatabaseTable - importing from postgresql...

Re: NiFi: Creating a single JSON from multiple SQL...

Re: What is a good approach for Spilitting 100GB f...

Re: What is a good approach for Spilitting 100GB f...

Re: What is a good approach for Spilitting 100GB f...

Re: Nifi - Is it possible to send flowfile from on...