About mburgess

mburgess · ‎04-08-2019

Since you want to change the array of values into key/value pairs, you'll need to put them in an object inside the variables array, so I'm guessing you want a single-element array "variables" containing an object that has the key value pairs. If that's correct, you can use JoltTransformJSON with the following spec, it adds keys for each value in the array based on its order: [ { "operation": "shift", "spec": { "variables": { "0": "variables[0].username", "1": "variables[0].active", "2": "variables[0].temperature", "3": "variables[0].age" }, "*": "&" } } ] This gave me the following output: { "id" : 123456, "ip" : "*", "t" : -12.9, "T" : -23.8, "variables" : [ { "username" : "user1", "active" : 0, "temperature" : 12.97, "age" : 23 } ] }

mburgess · ‎04-03-2019

I believe the flow file is entering the processor and just taking a very long time to process. In the meantime it will show up in the connection on the UI (although if you try to remove it while it's being processed, you will get a message that zero flow files were removed). The indicator that the flow file is being processed is the grid of light/dark dots on the right side of the processor. While that is shown, the processor is executing, ostensibly on one or more flow files from the incoming queue. For your script, I think the reason for the long processing (which I would think would be followed by errors on the processor and in the log?) is because you're reading the entire file into a String, then calling PDDocument.load() on the String, when there is no method for that (you need byte[] or InputStream). The very unfortunate part here is that Groovy will try to print out the value of your String, and for some unknown reason when you call toString() on a PDDocument, it gives the entire content, which for large PDFs you can imagine is quite cumbersome. Luckily you can skip the representation as a String altogether, since the ProcessSession API gives you an InputStream and/or OutputStream, which you can use for load() and save() methods on a PDDocument. I took the liberty of refactoring your script above, mine's not super sophisticated (especially in terms of error handling) but should give you the gist of the approach: import org.apache.pdfbox.pdmodel.PDDocument import org.apache.pdfbox.multipdf.Splitter flowFile = session.get() if(!flowFile) return def flowFiles = [] as List<FlowFile> try { def document session.read(flowFile, {inputStream -> document = PDDocument.load(inputStream) } as InputStreamCallback) def splitter = new Splitter() splitter.setSplitAtPage(2) try { def forms = splitter.split(document) forms.each { form -> newFlowFile = session.write(session.create(flowFile), {outputStream -> form.save(outputStream) } as OutputStreamCallback) flowFiles << newFlowFile form.close() } } catch(e) { log.error('Error writing splits', e) throw e } finally { document?.close() } session.transfer(flowFiles, REL_SUCCESS) } catch(Exception e) { log.error('Error processing incoming PDF', e) session.remove(flowFiles) } session.remove(flowFile)

mburgess · ‎04-02-2019

You have specified the SQL Statement property but haven't supplied any values. I recommend replacing PutSQL with PutDatabaseRecord with a Statement Type of INSERT, this should do what you are trying to do.

mburgess · ‎04-02-2019

You can use MergeContent or MergeRecord for this, it can take flow files each with a single record and combine them together to make a flow file containing many Avro records, then you can use ConvertAvroToParquet or PutParquet.

mburgess · ‎03-29-2019

That's a great idea, thanks! I've been meaning to update it, hopefully sooner than later 🙂

mburgess · ‎03-26-2019

Once a flow file has been created in a session, it must be removed or transferred before the session is committed (which happens at the end of ExecuteScript). Since your try is outside the loop that creates new flow files, you'll want to remove all the created ones, namely the flowFiles list. You can do that with simply: session.remove(flowFiles) rather than the loop you have in your catch statement.

mburgess · ‎03-21-2019

Instead of extracting the content into attributes (which will effectively turn the JSON objects into Strings), you should be able to add the core attributes using UpdateRecord or JoltTransformJSON/JoltTransformRecord. For the record processors, it looks like you could use the following schema for the reader: { "namespace": "nifi", "name": "myRecord", "type": "record", "fields": [ {"name": "headers","type": {"type": "map","values": "string"}}, {"name": "info","type": {"type": "map","values": "string"}}, {"name": "payload","type": {"type": "map","values": "string"}} ] } For the subflow to add core attributes, you'd want the writer schema to include the additional fields such as kafka.partition, then in UpdateRecord for example, you could add a user-defined property /kafka.partition with value 1 (and Replacement Strategy "Literal Value"). If in the core attribute subflow you want to remove the payload field, you can just remove it from the writer schema, and it won't be included in the output. For example: { "namespace": "nifi", "name": "myRecord", "type": "record", "fields": [ {"name": "headers","type": {"type": "map","values": "string"}}, {"name": "info","type": {"type": "map","values": "string"}}, {"name": "uuid","type": "string"}, {"name": "kafka.topic","type": "string"}, {"name": "kafka.partition","type": "string"}, {"name": "kafka.offset","type": "long"} ] } If I understand correctly, your headers+info+payload content would remain the same as the original. If instead you are trying to collapse all the key-value pairs from headers, info, etc. and possibly add core attributes, then JoltTransformJSON is probably the best choice. If that's what you meant please let me know and I'll help with the Jolt spec to do the transformation.

mburgess · ‎02-25-2019

It sounds like you're using ExecuteSQL as a source processor, meaning there are no incoming connections. In that case no flow file will be created. Instead, use GenerateFlowFile upstream from ExecuteSQL. You can schedule GenerateFlowFile for 10 minutes and set the content of the flow file to the SQL statement you want to execute. Then in ExecuteSQL you can schedule for 0 sec (as fast as possible, since the upstream processor runs every 10 mins) and remove the SQL Select Query property. At that point ExecuteSQL will expect the flow file content to contain SQL (which it will) and will execute that. If an error occurs, the flow file should be routed to failure, and you can use a PutEmail processor or something downstream for notification.

mburgess · ‎02-20-2019

I left an answer to your SO post, basically there is a bug or two preventing BLOB transfer from source to target DBs, but I presented a possible workaround.

mburgess · ‎02-20-2019

Is this a standard format (I know it's not proper JSON but is it some other standard format)? If not, you could write a ScriptedReader to parse your records. If all the data looks like the above though, you could use ReplaceText to replace ( with [ and ) with ], also remove the "u" prefix from some strings (like the address field), thereby converting it to proper JSON so you can use components like JsonTreeReader for example.

Online	Offline
Last Visited	‎01-16-2026 01:45 PM

Member Since	‎11-16-2015 02:21 PM
Last Visited	‎01-16-2026 01:45 PM
Posts	911
Kudos received	662

Cloudera Community

Re: Compare data within the JSON using NIFI

Re: how to join three csv files like sql on condit...

Re: How to see the Data Provenance and Lineage in ...

Re: Apache NiFi - RouteText has no matches

Re: Nifi Building error when creating a brand new ...

Re: Format JSON file using schema

Re: How can I use a Groovy ExecuteScript to split ...

Re: exeutesql to putsql

Re: Can I store more than one record in a parquet ...

Re: Creating multiple flow files and ExecuteScript...

Re: Creating multiple flow files and ExecuteScript...

Re: Split JSON flow file and recompose to new flow...

Re: How to test NIFI DBCPConnectionPool periodical...

Re: How can I get the Blob file using NiFi?

Re: How to convert the sensor data which is in unu...