About andrewg

andrewg · ‎11-19-2015

There's no single point where this change can be made. The reason is many components in the HDP stack will have their own log locations all set to /var/log. The good news is most of the components already create a subdir for their logs under /var/log. You could try symlinking /var/log/<component> to /var/log/hdp/<component> post-install, but the downside is having to apply the change on every node and deviating from defaults.

andrewg · ‎11-19-2015

Yes, you could do that, as long as ports don't conflict.

andrewg · ‎11-18-2015

No, and not sure you should. 2 NCMs managing 2 clusters (nodes 1-3 and 4-6) - perfectly ok. NCM (NiFi Cluster Manager) is a lightweight process and can be colocated with any of the processing nodes. You can set up a site-2-site channel to have 2 clusters communicate over a centralized data plane, too.

andrewg · ‎11-17-2015

JP, go for 2-3 typical enterprise-class servers to meet these numbers. Rough starting point is around 200MB/s per node.

andrewg · ‎11-11-2015

For a transparent resume feature follow these: https://issues.apache.org/jira/browse/NIFI-1149 https://issues.apache.org/jira/browse/NIFI-1150 https://issues.apache.org/jira/browse/NIFI-1151

andrewg · ‎11-11-2015

Jonas, this is a great fit for NiFi for the following reasons: As you correctly assumed, network issues could be common. Building new systems around the ingest layer to handle retries, load-balancing, security, compression and encryption (because you want to transfer this data over an encrypted channel, don't you?) is way more than people want to do. NiFi has those covered out of the box, link the nodes/clusters with a site-2-site protocol. E.g. see some discussions here http://community.hortonworks.com/topics/site2site.html and ref docs. Moving this data around usually comes with a requirement of maintaining the chain of custody and keeping track of the file every step of the flow (which is, again, an inherent feature of NiFi)

andrewg · ‎11-10-2015

Today we will show how to interact with a NiFi instance to modify a flow at runtime via API. Pre-requisites NiFi installed and running on a localhost: https://nifi.apache.org/download.html Groovy - because its JSON builders and REST DSLs are great. If you are on a Mac, the easiest is to run brew install groovy. To install Homebrew (the superb Mac package manager), visit http://brew.sh/ Full script is available on GitHub: https://github.com/aperepel/nifi-rest-api-tutorial NiFi REST API Docs: https://nifi.apache.org/docs/nifi-docs/rest-api/index.html Here’s the test flow we will be working with today: Prepare the test flow: Add a PutFile component to the canvas. Rename the processor to Save File (right-click -> Configure -> Settings -> Name field). We will be using this name to look up the processor later via API. Add a GetHTTP processor and create a connection from GetHTTP to ‘Save File’. GetHTTP settings can be ignored for now, the Save File processor simply needs an input connected. Set the Save File properties as below (these settings will be modified programmatically next) Start the Save File processor (no need to start GetHTTP for our purposes) Note: for a more complex flow, one would use templates: https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#templates Working with the API Next, we will update the Save File processor to use a different directory (/tmp/staging) and set Create Missing Directories to true. High-level script flow: Search the data flow for a component to operate on. Lookup term is 'Save File'. This is the same API used by the Search field in the UI. Validate there’s only 1 processor returned - we want to make sure we’re modifying the expected one. Sync up with the framework state - get the latest version field value, will be used in the update statement next. This is a classic Optimistic Locking pattern implementation. Build a small JSON document containing only state changes. Perform a partial update via a PUT operation. Repeat steps 4-5 to stop, update configuration (change directory and missing dirs properties) and start the processor. For the impatient among us execute the script directly (clone/checkout the github if you want to play with the code later): groovy https://raw.githubusercontent.com/aperepel/nifi-rest-api-tutorial/master/reconfigure.groovy You will see an output similar to this: Looking up a component to update... Found the component, id/group: c35f1bb7-5add-427f-864a-bdd23bb4ac7f/f1a2c4e8-b106-4877-97d9-9dbca868fc16 Preparing to update the flow state... Stopping the processor to apply changes... Updating processor... { "revision": { "clientId": "my awesome script", "version": 309 }, "processor": { "id": "c35f1bb7-5add-427f-864a-bdd23bb4ac7f", "config": { "properties": { "Directory": "/tmp/staging", "Create Missing Directories": "true" } } } } Updated ok. Bringing the updated processor back online... Ok If you check the NiFi processor again, you will see the updated Directory and Create Missing Dirs. Additionally, every step has been captured and recorded in the flow history: When you see a warning message in the UI, simply hit the Refresh link right next to it - I will explain the concurrency controls at the end of this article. Code Walkthrough First, we will pull in a dependency https://github.com/jgritman/httpbuilder/wiki/RESTClient . It is available in a public maven repository and is fetched automatically. @Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.7.1') This allows us to use nice REST DSL like these: nifi.get( path: 'controller/search-results', query: [q: processorName] ) nifi.put( path: "controller/process-groups/$processGroup/processors/$processorId", body: builder.toPrettyString(), requestContentType: JSON ) Next, we are using Groovy's JSON builder to construct a JSON document for a partial PUT update, i.e. only specify the properties you want to change in the update, like this: builder { revision { clientId 'my awesome script' version resp.data.revision.version } processor { id "$processorId" config { properties { 'Directory' '/tmp/staging' 'Create Missing Directories' 'true' } } } } Those dot-notation variables navigate the JSON document tree from a previous response. To understand how to structure it, start by issuing a GET request against your processor, which will fetch a complete state document. Tip: UI does everything through the REST API, it’s a great learning interactive learning tool in itself. One note, though, the UI will interchangeably leverage both PUT and POST (form) requests, so choose whichever is more convenient. In this write-up we will be using PUT with JSON. Finally, the clientId and version business is explained in the next section. Optimistic Locking in NiFi The diagram below describes the concept. Supplying a clientId is required for update operations to avoid running into consistency issues (the API will respond with 409 Conflict status code and it will be really confusing if a developer doesn’t know about this attribute). controller/revision returns the clientId of a user who last modified the flow among other things. This is NOT always your id, best practice is to supply your own unique value to identify the client. It’s actually a free-form value, UUID is just a default that the framework generates for you if missing.

andrewg · ‎11-09-2015

It will vary based on the nature of data, queries and node specs. Generally speaking, the hiveserver2 endpoint can now be clustered and scaled out horizontally. Users could be load-balanced across this farm. It's not so much about creating a perfectly-sized single instance, but rather have a good starting point (e.g. from that article), employ consistent ways of monitoring the process and node and experiment with a specific cluster deployment.

andrewg · ‎11-09-2015

I strongly advise against running everything as a single user. There are accounts for controlling the infrastructure and for data access. Mashing them together only exposes the attack vector and basically throws security out the window. If the drive was ti 'simplify' deployment and side-stepping corporate policies (and process) of creating new accounts, please re-consider.

andrewg · ‎11-06-2015

You need Kerberos if you're serious about security. AD/LDAP will cover only a fraction of components, many other systems will require Kerberos for identity. One can still keep users in the LDAP, but the first line in the infrastructure will be Kerberos. (examples: Storm, Kafka, Solr, Spark)

Online	Offline
Last Visited	‎11-29-2021 04:12 PM

Member Since	‎07-30-2019 11:14 AM
Last Visited	‎11-29-2021 04:12 PM
Posts	333
Kudos received	330

Cloudera Community

Re: getfile : nifi does not have sufficient permi...

Re: Back pressure settings not Honored when a Funn...

Re: Urgent need for ListSFTP & FetchSFTP working e...

Re: Raise alert from NiFi if file not available fr...

Re: NiFi: PutHiveQL reflect UDF not working

Re: Log file location - Is there a way to change /...

Re: NiFi Clustering: One NCM to manage multiple se...

Re: NiFi Clustering: One NCM to manage multiple se...

Re: Nifi sizing

Re: What are good tools / methods to get data into...

Re: What are good tools / methods to get data into...

Update NiFi Flow On-the-Fly via API

Re: HiveServer2 sizing

Re: Running all services as same user

Re: Kerberos, AD/LDAP and Ranger