Member since
01-09-2017
33
Posts
0
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1226 | 03-22-2019 01:45 PM | |
308 | 08-22-2018 03:00 PM | |
355 | 03-12-2018 04:45 PM |
03-22-2019
01:45 PM
If you need atomic sequencing and still want to use a paralell system you are gonna have to push that sequencing off onto a system capable of atomic sequencing. probably the easiest way is to write a stored procedure with a transaction in an rdbms. and do an executesql in your flow. Don't use a cache as MC says caches are not designed for transactional atomic stuff. Only use cache for actually caching stuff (you can get the value but its expensive and cheaper just to store for a bit in the cache.
... View more
01-31-2019
07:31 PM
Hey Matt, yeah thats what I said 🙂 Adam, I think your approach is wrong. If you are trying to get one flowfile to appear on each node, just have each node get the flowfile. If you want to send flowfiles between nodes, use s2s or rebalancing. You are re-inventing the wheel here.
... View more
01-31-2019
01:44 AM
One thing I tell people is to always put a limitrate in front of putemail. If you don't you will eventually send yourself 100000 emails in one second and get your nifi boxes blacklisted from your SMTP server 🙂 . Actually, the best solution in this case might be to feed putemail with a monitoractivity. This is a tough problem. I am skeptical that there is a practically generalizable solution. Right now I feel that monitoring and alerting need to be flow-specific, but I am interested to see what others are doing.
... View more
01-30-2019
09:46 PM
Never got a good answer to this, but on talking to other folks it seems like a solution is to have a paralell dev,test,prod cluster for each upgrade which is not very fun, especially with frequent HDF/nifi releases. Another solution would be to have a prodqa env and test there first, then upgrade prod,test,dev in that order, that way you are never pushing newer nifi flows to an older nifi version.
... View more
01-30-2019
09:28 PM
you'd have to look at the source, but its been my experience that ${hostname()} gives you the same value as if you ran the unix command 'hostname' Whether nifi is listening there just depends how you have configured other things.... I imagine that the node name you get back from the api is the nifi.web.http[s].host nifi property But it sounds like you are trying to do something that would be better served by just running a completely independent flow on each node.
... View more
01-22-2019
07:09 PM
@Matt Clarke laid it all out correctly, but I would add that usually the external system you are connecting to (sounds like you are talking about kafka) has authentication and authorization concepts so that is where the permissioning happens.
... View more
01-03-2019
03:54 PM
We have multiple nifi users and separate dev, test, and prod nifi instances. Our users are developing flows in dev and moving them to test and prod via a central flow registry instance. If we upgrade the lower environments before the higher, a user can develop flows that will not instantiate on the higher env due to them being developed with a higher version of nifi. But we want to upgrade the lower envs first to allow issues to surface in dev and test before we upgrade prod. My questions are What are the actual between-version flow compatibility goals of nifi? Are we guaranteed backward compatibility but not forward? What is a good strategy for handling nifi upgrades in a traditional dev->test->prod configuration, while maintaing availability and enabling mostly painless SDLC?
... View more
Labels:
11-26-2018
05:29 PM
NO! if you are trying to escape input to generate a sql query you should never roll your own sanitization unless you fully trust the input. THIS IS VULNERABLE TO SQL INJECTION! You should be using the '?' parameter substitution in your putsql stage.
... View more
10-23-2018
07:46 PM
I am going to attempt using the Component ID to Exclude property of SiteToSiteProvenanceReportingTask and just list every uuid of processors in my flow that writes the events to hdfs. It will just be a hassle to update all the uuids if I end up with a big flow.
... View more
10-23-2018
06:00 PM
We have a need to archive provenance data indefinitely. a simple file based archival to hdfs would meet our needs. From reading questions on this site, I know we can use the nifi site-to-site reporting task to send provenance events as flowfiles via nifif site-to-site. The obvious but probably wrong solution would be to point the reporting task to its own nifi cluster, catch the flowfiles and do a mergcontent -> puthdfs. This is probably wrong because this flow itself would generate more provenance events.. which would generate more provenance events.... forever. I'd really like to avoid the administrative burden of running another nifi instance, even a minifi instance. Has anyone come up with a good solution for archiving provenance without using another nifi cluster?
... View more
Labels:
10-04-2018
02:52 PM
Maybe I should make a new question for this, but in my organization I have alot of people asking me if they can start/stop their processors via the rest API. Of course they can, but I view that as an antipattern. Once you have things modifying your flow out-of-bands, your nifi flow is no longer the only place where logic is stored. Also mass start/stop of a process group suddenly becomes dangerous since some processors were maybe intended to be stopped. On a tiny single-purpose nifi instance this is probably not a problem, but on a large multi-tenant environment it can be. You can get run-once or adhoc functionality by singalling your flow to start with a in-band semaphore file or message or something like that. You can get scheduling through processor settings. You can get appropriate throttling of flows during downstream failures by sizing your queues appropriately. In my mind there is no reason to use out-of-bands rest api start/stop.
... View more
08-22-2018
03:00 PM
I finally found the policy by looking at https://github.com/apache/nifi/pull/2703/files It is /provenance-data/<component-type>/<component-UUID>
... View more
08-22-2018
02:37 PM
The HDF 3.2 release notes mention that provenance and data access policies have been separated in nifi 1.7. The release notes do not mention what resource identifiers should be entered in ranger to give users access to provenance. Looking at the nifi release notes is similarly unhelpul. my ranger nifi resource identifiers for data look like /data/process-groups/<uuid> but /provenance/process-groups/<uuid> appears not to work I would like to give people access to view any provenance events associated with components underneath a certain process group. What ranger nifi resource identifier should I use?
... View more
Labels:
07-12-2018
07:50 PM
@Matt Clarke Thanks for that info!
... View more
07-11-2018
06:20 PM
@Matt Clarke If I have disconnected a node (say to drain it for maintenance) I don't want my upstream sources to keep posting data to it. S2S ports already have this behavior and close on disconnect. If a node is disconnected, I don't want webui user sessions to hit that node. At best they will be very confused, and at worst they will make changes to the flow on that node that will cause problems when I reconnect the node.
... View more
07-11-2018
04:17 PM
You might want a load balancer if you use any of the listentcp/listenhttp/etc stages. you want to spread your user interface activity across nodes. (You'll need to pin sessions to nodes for this) you don't want to hardcode all your node hostnames into RPGs (though there are still issues with this) A big problem with configuring your load balancer is when a node is disconnected it continues to listen on the nifi webui/rest api port. You will have to write some external healthcheck that authtenticates to nifi and gets the actuall node status.
... View more
07-09-2018
01:53 PM
The solution we use is not perfect. Every project (tenant group) gets an input port that admins create on the root flow which we route into their PG. To prevent one tenant from accidentally writing onto someone else's input port we recommend they add a secret value attribute to their outgoing flowfiles and check for it via routeonattribute upon recieving flowfiles in their PG.
... View more
06-21-2018
09:29 PM
I am having a hard time understanding how to build a flow for a simple use case. The use case is doing an incremental fetch from a rest api. I just need to store the last index I have retrieved, and use that index as the starting index for my next fetch. I can think of a few ways of doing this but all of them seem to have problems. Here are my ideas and thoughts, how are other people tackling this use case? UpdateAttribute Stored State Use updateAttribute with stored state to store my index. Problem: update attribute only stores state locally, if my node goes down or there is a primary node switch, thats a problem Store State in Flowfiles Loop the output of my getHTTP through some updateAttribute stages that do something like ${new_beginning}=${last_end} Problem: the node with my state-storing flowfiles could go down. No way to "drain" a node if its got these long-lived flowfiles Store State in an external RDBMS Just store my index in some external db Problem: none really, just extra burden and dependencies Distributed Cache Controllers Talk to a controller service that stores state. Problem: From researching I find out that distributedMapCache isn't actually cluster wide, do I understand that correctly? Am I missing anything? How are others solving this use case that I imagine is very common?
... View more
Labels:
05-29-2018
08:13 PM
The Background: I have multitenant clusters. My organization has delivered me a combined truststore and keystore jks file. I will have users who need to hit various external-to-nifi services using SSL/TLS. It is tempting to create an sslcontext controller service at nifi root and let all my users use this service when they need SSL. One problem with this approach that I see is that if I let all my users use the hosts' certs, (keystore and truststore) they could just use a GetHTTP processor and talk to the nifi rest api with full priveleges (or at least whatever privs the node has) So to prevent this I figure that I should get my root CA certs into a separate truststore that is not password protected, and only use this. The question: Should I create an sslcontext service at the root flow with only a truststore, and ask my users to all use the same controller service? Or should I just notify my users of the path to the truststore and have them create controllers within their process groups as needed, what are the pros and cons of each approach?
... View more
Labels:
05-09-2018
05:31 PM
thanks @Matt Clarke, what would we do without you!?
... View more
05-09-2018
04:59 PM
Thanks @Matt Clarke, But if processor state alone cannot be used to handle primary node changes, how do processors like GenerateTableFetch work without a distributedMapCache service? Both ListSFTP and GenerateTableFetch mention in their docs that they store cluster-scoped state, but only ListSFTP can also make use of a cache service. What am I missing here?
... View more
05-09-2018
04:36 PM
I just noticed that ListSFTP can use a distributed cache controller. This is confusing to me because I thought we were supposed to only run ListSFTP on the primary node, and rebalance filenames via S2S RPG. In addition to the distributed cache, it also seems to store state. This is confusing to me because if we use a distributed cache controller, why would the ListSFTP need to store state? What is the current best practice for resiliant, parallelized sftp? If I use a distributed cache, does that mean I can just schedule my ListSFTP to run on all nodes? Can someone help me understand what is going on here? Thanks!
... View more
Labels:
03-12-2018
04:45 PM
@MattClarke answered this question in https://community.hortonworks.com/questions/176292/how-to-configure-managed-ranger-authorizer-for-nif.html
... View more
02-20-2018
03:45 PM
HDF 3.1 includes nifi 1.5, and the release notes mention that now external LDAP groups can be used in nifi security policies in ranger. It seems that we need to use org.apache.nifi.authorization.ManagedRangerAuthorizer in the authorizers xml but I cannot find any documentation on this. Has anyone successfully used LDAP groups for ranger nifi polices? And is there any documentation? Thanks. PS. I see @Yolanda M. Davis in the nifi git history for this feature, perhaps she can help?
... View more
Labels:
01-29-2018
03:48 PM
I ended up just using two ReplaceText processors to escape special chars with \, then replace 0x01 with commas. I'd still like to know if there is a way to enter hex characters though.
... View more
01-24-2018
05:29 PM
I have csv files that are delimited with ascii 01 bytes. I want to utilize a CSVReader controller to read these records and convert to other formats. My problem is that I don't know how to input a 0x01 byte into the CSVReader "Value Separator" property. Is there a standard way to do hex escapes in nifi properties? This property does not support the nifi expression language. Thanks!
... View more
Labels:
11-27-2017
04:27 PM
I will accept the answer because it seems this might be an issue on my side. Thank you I will open up a ticket with hortonworks support if further troubleshooting is needed after I take a look at the logs. thanks again.
... View more
11-20-2017
06:30 PM
@Ashutosh Mestry Thanks Ashuthosh, when I run the query: "hive_table where db.name like '*_final'" I get an error in the webui: Gremlin script execution failed: L:{def r=(([]) as Set);def f1={GremlinPipeline x->x.as('a0').out('__hive_table.db').as('__res') [0..<25].select(['a0', '__res']).fill(r)};f1(g.V().has('__typeName','hive_table'));f1(g.V().has('__superTypeNames','hive_table'));r._().as('__tmp').transform({((Row)it).getColumn('a0')}).as('a0').back('__tmp').transform({((Row)it).getColumn('__res')}).as('__res').filter({it.'Asset.name'.matches('.*_final')}).back('a0') [0..<25].toList()} We are running atlas 0.8.0.2, perhaps like clauses are unsupported on our version? I can use an equal sign and successfully retrieve tables in a certain database. do you know of a way to get the same information with a basic query?
... View more
11-17-2017
04:19 PM
I wish to programmatically query atlas to provide a list of hive tables that are in certain hive databases. I only want to see hive tables that are in databases that contain a certain string in their name. In the hive_table atlas type, the db property is a reference to an entity of type hive_db, so I cannot use a simple where clause. For example pretend I have many hive databases, some end with '_temp' some end with '_final'. Each database may have several tables. I want to generate a list of all hive tables in databases that end with '_final.' I would also like to exclude hive tables that have been deleted. I have been experimenting with using the /api/atlas/discovery/search/dsl rest endpoint, but I have had no success. There is documentation for the dsl at http://atlas.apache.org/Search.html, but this documentation is very esoteric, and I cannot figure out how to use it. Does anyone have examples of returning lists of entities in atlas bases on properties of referred-to entities? Is there a more user-friendly or complete source of documentation for the atlas query dsl? Also note that I do not wish to query the hive metastore directly, I wish to use atlas. Thank you for any help!
... View more
Labels:
09-26-2017
03:20 PM
I also am unable to browse hortonworks.jira.com and would like to be able to.
... View more