About suman_gupta2

suman_gupta2 · ‎09-13-2016

@Andrew Grande Thanks for the clarification.

suman_gupta2 · ‎09-09-2016

Hi, I noticed that the only configurable part in PutHiveQL processor is Hive Database Connection Pooling Service Name. And this needs to be setup as a controller service. So i'm wondering where to mention the actual insert statement or query which will do the job. Does this imply that there has to be other processor preceding to it with a means to generate insert queries that are specific to where and how we want the data in Hive and is fed to this processor as flowfiles?

suman_gupta2 · ‎08-30-2016

@Pierre Villard Yes please, we do need help with how to do it with groovy, because sooner or later we will have to implement this in any performance-friendly way possible. We are looking at files that comes in GBs or more, and no way we can have a disappointingly slow DF like the current one. Any help is appreciated.

suman_gupta2 · ‎08-30-2016

@Pierre Villard , my thoughts exactly, I am really doubtful about the regex processing there, so far I have browsed through communities and mail archives and learned that compilation of regular expression in a ReplaceText Processor or any other processor occurs every time a flow file is pulled from queue with records, and since in this case the evaluation mode is set to "line by line" , the regex is compiled first, checked against each line and then replaced with the selected $1,$2.. expression in ReplacementValue. It could go unnoticed when a small file with 10-100 records with 5-10 columns, but when the file has 21 columns with 1000+ records, I do believe the processing can hit the performance wall really hard, hence the discernible bog down in speed as we can see in the RouteText -->ReplaceText stage. Correct me if I'm wrong, but isn't it a more ETL specific operation ( by that I meant the selective column splitting with regex) than a dataflow specific operation. I mean its not entirely wrong to presume that NIFI being a DFM tool, will behave like this if ETL heavy operations are forced into it? Are there other efficient ways to do this in NIFI or is it a good idea to delegate the splitting task to an external script using ExcecuteScript processor? Please hit us with some insights.

suman_gupta2 · ‎08-23-2016

Thanks for the confirmation @Bryan Bende, Like you said , if the source directory is not meant to be a shared location, then it always should be on Primary Node. So, I tried the same with the ListFile Processor, I changed the source directory to a path : which is not a shared location, and also in Scheduling Information, I have prompted it to run on "Primary node only". The problem is that its throwing an error of the like "Cant find path", meaning its not being able to locate where the file is in the primary node. Surprisingly if i take that the next FetchFile processor out of the Process Group ( and out of Load balanced mode ) and put in succession to the ListFile, the error stops popping up. Which also means that FetchFile will no longer work in distributed mode. Is it the expected behavior or am i missing something?

suman_gupta2 · ‎08-22-2016

Hi all, In reference to a post https://community.hortonworks.com/questions/52015/fetch-file-from-the-ncm-of-a-nifi-cluster.html according to what @Simon Elliston Ball have explained, I have simulated the scenario in my native cluster. I have three nodes, among which Node 1 is configured as an NCM and rest two are primary and secondary slaves added in simple Master-Slave mode. Now, against a shared directory location, I have used a ListFile Processor and then fed the flow files site-to-site back to the same cluster by using input ports to load-balance a FetchFile processor followed by remaining flow. I have attached a .png file for reference. Q. 1 Can anyone please confirm if the mentioned model should work correctly in terms of processing the load in a node-distributed fashion? Q. 2 Is the above model will hold preferable if the origin directory location cant be shared by a business restriction and it will only reside in primary node? What could be the desired site-to-site model then? Attached Image :

suman_gupta2 · ‎08-17-2016

@Matt Burgess It worked! I tried with grape install command from command line and it popped some messages which stated that the jars are found in library. Now i can see a .groovy/grapes folder in documents which has the appropriate jar files! I ran the scripts thereafter and it successfully updated the processor properties as expected. I guess its now resolving the missing jar file paths on its own since the install. However I also noticed that having the @Grab part active or not doesn't matter to the code anymore. Much appreciated and thanks a ton to you and Bryan for the most valuable inputs!

suman_gupta2 · ‎08-16-2016

Hi Matt, thanks for your comments, I followed what you said and downloaded the Apache Ivy Jar named : "ivy-2.4.0.jar" from http://ant.apache.org/ivy/download.cgi and put it under in $NIFI_HOME/lib folder and the restarted the NIFI service. As you highlighted, i should be able to use the grab annotation now. @Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.7.1') So in the script i have un-commented the @Grab part and ran it from command line with groovysh. Now i'm getting the below error after executing the script. The error is in attached file. error.txt Is some configuration still missing, i read your blog post to know that it has something to do with the Grape cache and we need to configure it correctly? I could see that there is no /grapes folder under the /groovy directory

suman_gupta2 · ‎08-12-2016

Hi, I've been trying to run an update processor script for nifi using Groovy. I am reusing the below script from Github. import groovy.json.JsonBuilder import groovyx.net.http.RESTClient import static groovy.json.JsonOutput.prettyPrint import static groovy.json.JsonOutput.toJson import static groovyx.net.http.ContentType.JSON //@Grab(group='org.codehaus.groovy.modules.http-builder', // module='http-builder', //version='0.7.1') def processorName = 'Save File' def host = 'localhost' //def host = 'agrande-nifi-1' def port = 9090 def nifi = new RESTClient("http://$host:$port/nifi-api/") println 'Looking up a component to update...' def resp = nifi.get( path: 'controller/search-results', query: [q: processorName] ) assert resp.status == 200 assert resp.data.searchResultsDTO.processorResults.size() == 1 // println prettyPrint(toJson(resp.data)) def processorId = resp.data.searchResultsDTO.processorResults[0].id def processGroup= resp.data.searchResultsDTO.processorResults[0].groupId println "Found the component, id/group: $processorId/$processGroup" println 'Preparing to update the flow state...' resp = nifi.get(path: 'controller/revision') assert resp.status == 200 // stop the processor before we can update it println 'Stopping the processor to apply changes...' def builder = new JsonBuilder() builder { revision { clientId 'my awesome script' version resp.data.revision.version } processor { id "$processorId" state "STOPPED" } } resp = nifi.put( path: "controller/process-groups/$processGroup/processors/$processorId", body: builder.toPrettyString(), requestContentType: JSON ) assert resp.status == 200 // create a partial JSON update doc // TIP: don't name variables same as json keys, simplifies your life builder { revision { clientId 'my awesome script' version resp.data.revision.version } processor { id "$processorId" config { properties { 'Directory' '/tmp/staging' 'Create Missing Directories' 'true' } } } } println "Updating processor...\n${builder.toPrettyString()}" resp = nifi.put( path: "controller/process-groups/$processGroup/processors/$processorId", body: builder.toPrettyString(), requestContentType: JSON ) assert resp.status == 200 println "Updated ok." // println "Got this response back:" // print prettyPrint(toJson(resp.data)) println 'Bringing the updated processor back online...' builder { revision { clientId 'my awesome script' version resp.data.revision.version } processor { id "$processorId" state "RUNNING" } } resp = nifi.put( path: "controller/process-groups/$processGroup/processors/$processorId", body: builder.toPrettyString(), requestContentType: JSON ) assert resp.status == 200 println 'Ok' Except i have commented out the Maven dependency part using @Grab part. I have directly download the http-builder-0.6.jar and put it into groovy designated Library directory to have the classes available. I have also downloaded and put the json-lib.jar in the library. I'm still getting below exception at run-time groovyx.net.http.ContentType.JSON Caught: java.lang.NoClassDefFoundError: net/sf/json/JSONObject java.lang.NoClassDefFoundError: net/sf/json/JSONObject at groovyx.net.http.HTTPBuilder.<init>(HTTPBuilder.java:175) at groovyx.net.http.HTTPBuilder.<init>(HTTPBuilder.java:194) at groovyx.net.http.RESTClient.<init>(RESTClient.java:79) at Groovy_Sample.run(Groovy_Sample.groovy:13) Caused by: java.lang.ClassNotFoundException: net.sf.json.JSONObject I cant trace out whats possibly going wrong, even with the jars in library, I have read that setting the right version in Maven dependency might help, but I dont use it, is there no way to do it without Maven? OR am i missing something with the Jar and libraries.

Online	Offline
Last Visited	‎04-03-2017 09:26 AM

Member Since	‎08-10-2016 05:29 AM
Last Visited	‎04-03-2017 09:26 AM
Posts	18
Kudos received	2

Cloudera Community

Re: PutHiveQL Processor function in NIFI

PutHiveQL Processor function in NIFI

Re: FlowFiles getting queued before NIFI ReplaceTe...

Re: FlowFiles getting queued before NIFI ReplaceTe...

Re: Site-Site Load Balancing in the same cluster.

Site-Site Load Balancing in the same cluster.

Re: NoClassDefFoundError: net/sf/json/JSONObject

Re: NoClassDefFoundError: net/sf/json/JSONObject

NoClassDefFoundError: net/sf/json/JSONObject