About andrewg

andrewg · ‎10-05-2015

It's a known issue with Ambari 2.1.1. When performing this rolling upgrade for HDP, ensure you're using Ambari 2.1.2.

andrewg · ‎10-03-2015

Thanks Joe. As I understand, in this scenario we could leave provenance and flowfile repos on the local disks (regular application server sizing), but for content could mount a big fat SAN/NAS/you-name-it and configure HDF to point to that. Are expiration policies configurable per-repository in that case?

andrewg · ‎10-03-2015

Hi, what are the recommended approaches for handling the following scenario? NiFi is ingesting lots of files (say, pull from a remote system into the flow), and we care about file as a whole only, so flowfile content is the file, no further splits or row-by-row processing. The size of files can vary from few MBs to GBs, which is not the problem, but what happens when there are millions of files ingested this way? Say, they end up in HDFS in the dataflow. Given that file content will be recorded in the content repository to enable data provenance, disk space may become an issue. Any way to control this purge/expiration on a more fine-grained level other than instance-wide journal settings?

andrewg · ‎10-01-2015

Are you referring to the hadoop-policy section in core-site and hdfs-site? These do not control security the way you'd expect. For proper ACLs on HDFS do either of these: Secure (Kerberize) your cluster. Ambari automates this. Add Ranger and enable HDFS policies. If accessing via REST API (WebHDFS) - restrict direct datanode access via a firewall and only allow access via Knox. Knox, in turn, will be able to map an incoming user into an actual role (still, full control with audit will require adding Ranger). Andrew

andrewg · ‎10-01-2015

Hi All, The use case is a Banana dashboard working with a SolrCloud instance/cluster. If we follow default steps we end up using the 'data_driven_schema' in Solr, which makes it easy for it to accept any random data and try index it. However, the problem is down the line. Banana table widget can't sort on many columns as Solr complains about those fields being multi-valued. In fact, they are not, and unique (checked via admin section), but rather declared multi-valued. What is the approach to address this? Ideally, without having to specify a complete new schema for a Solr index. Can one have a benefit of flexible fields, but default to non-multi-valued maybe?

andrewg · ‎09-29-2015

Thanks Mark, that was a great and exhaustive answer (I'm thinking of how to express it in a slide for advanced level deck). I guess the control plane HA for NCM itself is the next bastion, will probably require some changes on the client side (e.g. specify a list of failover NCM nodes to cycle through) as well as some UI updates to support it. Is my understanding correct?

andrewg · ‎09-29-2015

I highly recommend Knox's shell which uses a DSL for those operations http://knox.apache.org/books/knox-0-6-0/user-guide.html#WebHDFS Great way to programmatically interact with a cluster in a controlled and audited manner (e.g. simpler DSL and secured gateway endpoint, no need to open every node's port). BTW, it's a groovy DSL, which makes it trivial to run in any Java program.

andrewg · ‎09-29-2015

Hi, I'd like to understand how the receiving end of the site-to-site protocol works. The sending side drops a remote process group on the canvas and is mostly done. The receiving side - simple in case of a single NiFi node. In a cluster, though, we still need to specify a FQDN to connect to. What is the best practice there? If we put a load balancer in front, would it break batch affinity (when site2site batches sends for you)?

andrewg · ‎09-29-2015

HDFS Balancer can run in the background and there is a controllable bandwidth that it consumes. In general, on a large cluster it can run continuously, but it is a must after adding new nodes to have a healthy system. Note for large clusters a single convergence run can be a full day or more (that shouldn't scare you away though), let it run. Also, some customers reported that had more stable experience when adding nodes in small batches of a few instead of adding a full rack at once, for example.

andrewg · ‎09-28-2015

I would highly recommend against re-using another ZK quorum for this purpose. The risk of the network partitioning is too high and the benefits aren't clear. As David mentions above, NN doesn't put high load on ZK for leader election. Have each NN HA pair (cluster for that matter) talk to their own ZK quorum within the same network segment.

Online	Offline
Last Visited	‎11-29-2021 04:12 PM

Member Since	‎07-30-2019 11:14 AM
Last Visited	‎11-29-2021 04:12 PM
Posts	333
Kudos received	330

Cloudera Community

Re: getfile : nifi does not have sufficient permi...

Re: Back pressure settings not Honored when a Funn...

Re: Urgent need for ListSFTP & FetchSFTP working e...

Re: Raise alert from NiFi if file not available fr...

Re: NiFi: PutHiveQL reflect UDF not working

Re: Rolling upgrade failing for a clean cluster fr...

Re: Best practice around provenance and ingesting ...

Best practice around provenance and ingesting larg...

Re: User can view entire hdfs dir and navigate fur...

Solr - avoid 'multi-valued field can't be sorted' ...

Re: Site-to-site protocol load balancing

Re: Script to get files from HDFS to local OS "Aut...

Site-to-site protocol load balancing

Re: What are the best practices and recommendation...

Re: Can one Zookeeper quorum support multiple HDFS...