About jfrazee

jfrazee · ‎06-20-2017

@Arsalan Siddiqi You might get some mileage out running NiFi in cluster mode (you don't say whether it's a single instance or not). I can't imagine being able to pull 22M records over the Spark receiver in a timely manner otherwise. Also, I'd use a cascade of Splits, breaking down the file in steps so 22M -> 1M -> 100k -> 1. This will help a lot with memory utilization. And if you haven't seen it before, there's a trick to re-balance FlowFiles across a cluster by doing a self site-to-site transmission with a RPG pointing at the current cluster. That said, for this kind of workload you will also be better served by delivering the data to Kafka and using the Kafka receiver. You might still need to consider a cluster for NiFi however. Last, make sure that you have chosen an appropriate batch interval on the Spark Streaming side. Start with something large, e.g., 10 secs, and work down from there when you're tuning your app.

jfrazee · ‎06-20-2017

@Alvaro Muir Unfortunately InvokeHttp isn't currently built to transfer the chunks as individual FlowFiles for chunk transfer encoded responses. In lieu of that I think you have a few options: You can use ExecuteProcess with the Batch Duration set along with curl or httpie and just capture the output from shell command. You could create a scripted or custom processor. The trick for this one is going to be that you're going to have to have another thread reading the chunks off the long running request and feeding them to a process local queue. That way, the processor's job is to just check that queue and then transfer whatever chunks it sees at that time. To be specific, your @OnScheduled method could connect to the HTTP endpoint, read individual chunks, and push them onto a LinkedBlockingQueue. Then your @OnTrigger method could do a poll() or take() to see if any chunks are available, iterating through, creating a new FlowFile out the chunks and doing a session.transfer() for each. The GetTwitter processor is the prototypical example of this pattern. In it, the Hosebird client is setup in @OnScheduled to feed eventQueue and @OnTrigger then polls() eventQueue for the Tweets. You could just use curl or httpie to create files and have NiFi pick those up with GetFile. This is pretty silly but `http --stream <URL> | split -l 1` will actually create individual files out of each chunk.

jfrazee · ‎06-14-2017

@Timothy Spann Try minifying the Avro schema source (i.e., removing all the spaces and newlines) or just all the leading spaces from each line. It should work then.

jfrazee · ‎03-22-2017

@Alex Woolford There are a few things you can try (none of which are really NiFi concerns): iptables port redirection Run something like HAproxy to forward tcp traffic from 514 to the selected port in NiFi Use the cap_net_bind_service available in more recent linux kernels to allow the JVM to bind to privileged ports without running as root

jfrazee · ‎11-22-2016

@Raj B I laid out some of the options in a recent mailing list discussion [1]. These include both of the suggestions above, including a link to a work-in-progress implementation for the ControllerService approach. 1. http://mail-archives.apache.org/mod_mbox/nifi-dev/201611.mbox/%3c31b8fe4f-95f6-4419-80a9-f9a728a9cb7c@me.com%3e

jfrazee · ‎11-08-2016

@Gerard Alexander sliding() keeps track of the partition index, which in this case corresponds to the ordering of the unigrams. Compare rdd.mapPartitionsWithIndex { (i, p) => p.map { e => (i, e) } }.collect() and rdd.sliding(2).mapPartitionsWithIndex { (i, p) => p.map { e => (i, e) } }.collect() to help with the intuition.

jfrazee · ‎11-08-2016

@Ankit Jain There are a few ways you can do this: ReplaceText with a regex matching lines that start with PID You could SplitText -> RouteText matching lines that start with PID What you can't do though is extract the PID first and then run it through ExtractHL7Attributes. ExtractHL7Attributes requires a full, valid HL7 message. If you want to do that, you're best bet is to run ExtractHL7Attributes and then (re)create a new message from the created PID attribute values.

jfrazee · ‎11-08-2016

@Ankit Jain In AttributesToJSON do you have the Destination property set to "flowfile-content"? If you don't then what it does is put the JSON in the JSONAttributes attribute and it leaves the FlowFile contents the same, in this case an HL7 document. An HL7 document of course isn't JSON and starts with MSH, so this is the error you'd see if you have Destination set to "flowfile-attribute" (the default) and not "flowfile-content".

jfrazee · ‎11-08-2016

@Raj B Is there any chance you can share an example of the JSON message and any stack trace that is output in nifi-app.log?

jfrazee · ‎10-21-2016

@Jeeva Jeeva You're probably best off posting that as another question (both to get it answered and so it's more searchable). I don't have anything in hand at the moment. Best.

Online	Offline
Last Visited	‎09-18-2017 09:10 PM

Member Since	‎02-22-2016 03:57 PM
Last Visited	‎09-18-2017 09:10 PM
Posts	60
Kudos received	62

Cloudera Community

Re: How can I send FlowFile content to String in J...

Re: non-Kerberized HDF communication with Kerberiz...

Re: Is there a way to do a count Approx for a data...

Re: update attributes with other existing attribut...

Re: NiFi LookupAttribute and UpdateAttributes

Re: huge csv to spark

Re: How to ingest data from invokeHttp in chunks f...

Re: NiFi 1.2 with AvroSchemaRegistry - Adding Sche...

Re: ListenSyslog won't listen on port 514 because ...

Re: NiFi Code to Description mapping

Re: sc.parallelize effect on parallel algorithms f...

Re: Error while storing HL7 attributes in Hbase

Re: Error while storing HL7 attributes in Hbase

Re: NiFi InferAvroSchema error - illegal character

Re: Scala Implicit Conversion