Member since
09-15-2015
75
Posts
33
Kudos Received
4
Solutions
12-27-2016
05:42 PM
4 Kudos
Properly Size Index Understanding what to index often requires deep business domain expertise on the data. This yields better indexing strategy and increases accuracy for searching data. Not all data will be indexed but for an organization who's acquiring brand new data, this requires indexing all data until it is understood what value it brings to the business. What this means is that data needs to be re-indexed so it is a good practice to store raw data somewhere cheap, often in HDFS or in the cloud object storage. Tuning for Speed This starts with monitoring QTime. It provides performance metrics on how fast the request was received, query parsing and actual search. This doesn't include the time it took to send back the response to the client which depends on heavily on how big the payload and how fast the network I/O is. Sorting works best with short valued properties like price, age, etc but doesn't on tokenized values like date, long field type values and others. For range queries, use trie field types. Otherwise wise, avoid it. For near realtime search, soft commits are recommended since this brings the recently indexed data available in memory. A good interval soft commit is 15 seconds. Hard commit on the other hand is more for durability where index goes to disk first then memory. 60 seconds of hard commit interval is a good value to avoid the transaction logs getting out of hand. The server restart will be very slow the longer the commit interval. Parallel SQL is very slow and should only be used for batch type searching. It works very similarly with how Map-Reduce work in Hadoop. The only value this brings is in querying index data from multiple collections using SQL syntax. Oversharding can be used for performance reasons where all machines has shards for specific replica. This kind be very helpful with a very large data set. If the size of the index is smaller than the available memory of the Solr cluster, it is possible to load them all into OS Cache by running a touch command recursively on all index files. Sizing Hardware There are several factors that influences the hardware configuration.
# of documents frequency of data updates # of requests per second average size of document # of features that impacts heap consumption # of documents Storage bound first, then memory. Depending on how many fields are scored, this can consume a lot of memory. frequency of data updates CPU bound first then I/O. CPU impact is big due to deserialization cost. It affects memory management as well. I will require a decent heap size. # of requests per second CPU bound. average size of document Storage bound. The # of terms also incurs heap overhead. Rule of thumb is that raw to index ratio is typically 5:1. # of features that impacts heap consumption Memory bound. Heavy usage of facets and sorts will require good amount of memory. A single facet query can bring a cluster down. Facets can drive the cost of the hardware. This requires further understanding on how facets are used and properly design query. Having the right data is always correct than scoring algorithms. Get hits first, then score.
... View more
Labels:
12-08-2015
01:07 AM
5 Kudos
Q: What are the use cases for Centrify? A: Integrate AD with Linux and there are no local user deployment similar to how SSSD is configured. The size of the cluster and/or domains is big enough that it's hard to manage with SSSD. Centrify greatly simplifies the management of this type of environment. Centrify ldapproxy also abstracts the complexity around integrating with other filers/appliances like Isilon, NetApp, etc. ldapproxy fronts Hadoop and doesn't expose internals of AD but only give information about a zone. It is usually used for machine to machine type authentication. Q: If there are multiple domains in a forest, how does Centrify know which domain controller to use to authenticate a user? A: Centrify walk the forest tree and figure out what domain controller to use to authenticate the user. It utilizes Domain Controller Service to perform this action. It doesn't use krb5.conf file. The Centrify agent knows what forest or domain controller it belongs to. It is PAM and site aware and its base authentication mechanism is Kerberos. It builds an index of Domain Controllers and DNS Servers and tag them based on the response time. Based on this information, the agent will know if a particular DC or DNS server has issues and will not use them. DNS must be setup properly , including reverse lookup for Centrify to work. The agents supports authenticating the same user that exists across different domains. Q: What happens when Centrify agents fail? A: Centrify does not store AD information and there's no such thing as policy server. It completely leverages the AD infrastructure to scale out. Centrify DirectControl (CDC) watches all Centrify agents and restarts them when they fail. See below diagram for reference. If all else fails, Centrify can fallback to NTLM if it needs to. For example some users don't have Kerberos enabled on their laptops due to inherent issues and has to resort to NTLM. Q: What's the best practice for laying out the Centrify policies? A: The basic building block of Centrify policies is zone. A zone is how Centrify organizes the data inside of AD. Zone is a unit of cluster. The data is essentially user information, unix group information, unix computer information, role-based access control and many more. The reason why this is done is because of service connection point. Service Connection Point is a multi-diag object that's been available since Windows 2003 and Centrify link that back to the real AD object. This provides flexibility on naming conventions for zones, what objects to link it to in AD. The Service Connection point can be seen from "Active Directory Users and Computers" window as shown below. These service points are what PAM will use to authenticate users.p The image below describes best practice layout of creating policies in Centrify. All users are defined under Zones->UNIX Data->Users. Remember, all users and groups are created in AD. What shows up here are just pointers to the AD user objects. These users will eventually be inherited to the Child Zones. The Hadoop cluster is the boundary for Centrify policies. No Hadoop node should belong to multiple zones. The only exception here is when an RDBMS is used for Hadoop components that would need it i.e. Ambari, Oozie, Hive. Centrify agents supports multiple domains where same user exists across domains. Hadoop jobs pick up the real AD user. It is best to name the child zones in lower case and must match the Hadoop cluster name. In the sample policy above, "smesecurity" is the name of the child zone and it's also the name of the Hadoop cluster with case matching. Only the nodes within this cluster should exist in Zones->Global->Child Zones->Computers. The global users are not automatically pushed down to child zones. It has to be explicitly added. For users to successfully login to linux machines, they have to have a complete profile - UID, GID, and a Role Assignment. Role Assignment grants the access. There will be users that exists on the child zones that don't exist on the parent zone. These are normally the service accounts that lives only on the child zone. The OU structure has to lineup with how the zone is structured. This is the best practice. It is possible to redefine the same user in the child zone with different properties basically overriding what's defined globally for that user. For large cluster installations, it's easier to use VPA, part of DirectControl component, that automates the creation of user profile in Centrify by just dropping the user into AD Groups. This is done through PowerShell or Linux/Unix command interface. All the policy information entered in Centrify are stored in AD. See below. The green box shows everything that was defined in Centrify including the "smesecurity" child zone. This is also replicated across active directories for redundancy purposes. Q: Centrify creates service principals for nfs and http. Will this create issues with Kerberizing HDP? A: Yes. Centrify has its own Kerberos module for nfs and http. When Kerberizing clusters with Ambari, it automatically generates principals for nfs and http services and this clashes with Centrify. To prevent issues, update the file /etc/centrifydc/centrifydc.conf on all machines and look for the property adclient.krb5.service.principals. Remove "nfs" and "http" entries. It should look like this. adclient.krb5.service.principals: ftp cifs If for some reason, the nfs and http entries were not removed and Kerberos wizard in Ambari was run, NFS Gateway, DataNode and other components that depends on http will fail. To resolve this, update all the centrifydc.com and remove nfs and http as described above. Also remove the http and nfs SPNs from AD. Then on all machines, run the following commands. # adreload
# service centrifydc restart Q: Centrify ldapproxy won't start using TLS. Certificate cannot be found. A: Common issues with ldapproxy not starting up successfully is normally caused by certificate names and casing not matching between AD and Centrify. Check the certificates in /var/centrify/net/certs/ if the certificate names matches. Make sure that the file vi /etc/centrifydc/openldap/slapd.conf has entries for the centrify certificate. See sample below. # Centrify specific
TLSCACertificateFile /var/centrify/net/certs/auto_ComputerForLdaps_CA.pem
TLSCertificateFile /var/centrify/net/certs/auto_ComputerForLdaps.cert
TLSCertificateKeyFile /var/centrify/net/certs/auto_ComputerForLdaps.key
Q: How does Centrify computer roles play into Hadoop clusters? A: Computer roles allows you to define a set of rights to a logical group of computers. Ambari, Oozie and Hive Metastore all uses RDBMS systems and there's growing trend that organizations prefer to use Oracle and SQL Server. For example an Oracle Admin and Oracle Server(s) are defined in computer roles and the admin rights are applied to these servers regardless of location. These servers can be used by multiple Hadoop clusters. The provisioning of computer role assignments can be done at the zone level or at the node level. There's this concept of delegating zone control from within a zone, computers and users, that can be used to specify what group have admin rights to it (not root rights but AD rights - see image below). Q: When a new AD is added to the forest, how does Centrify pick it up? A: There are configurations that allows Centrify agents to automatically walk the tree of AD domains and discover new AD servers within the forest. The discovery process is time based and can be changed. The agents also keeps track of what AD controller is up or down. There are PTR records in the AD DNS Manager as shown below that is used by Centrify agents to discover Domain Controllers and Global Catalog servers. Q: Linux servers have their own DNS services and AD has its own built-in directory services. It's a painful process to point the Linux servers to AD and build PTR records for them. How does Centrify make this more seamless? A: Centrify supports integrating with two different DNS environments (i.e. hortonworks.net and hortonworks.com) through a feature called "alias". Though possible and supported, it is not recommended to setup Centrify and Hadoop to deal with this type of configuration. Q: What's the behavior of Centrify when a user logs in to machines using ssh? A: If the user provided a password to login, kerberos ticket will be automatically generated. If ssh key is used, it will not automatically generate the ticket. User has to kinit. When forwardable tickets are turned on in windows kerberos systems, the user does not have to kinit again. Q: How does Centrify sync with latest AD changes? A: Centrify has a utility called adflush to pull down the changes from AD. It could be an expensive process depending on what information is being pulled down. adflush will be a perfect tool for developers in POC mode. Q: How can I blacklist users in Centrify? A: You can enter the users that you want to block in this file /etc/centrifydc/users.ignore. Q: How do you safely snapshot Centrify? A: If you snapshot machines with Centrify agents and roll back to the latest version and the keytab file changed, the machines won't be able to authenticate with AD. Make sure that when snapshots are running that keytabs are the same when rolling back to a specific version. Q: With a very large cluster (in the thousands of nodes), how do you scale with Centrify and AD? A: It is recommended to deploy the Domain Controller in the same rack space as the Hadoop nodes. You want your AD to be replicated. Hadoop will hammer AD with requests and you want to make sure that AD can handle it. Centrify is agent based so no issues with scaling. The agents know which domain controllers to go to and which one they can connect to faster.
... View more
10-02-2015
08:01 PM
7 Kudos
Here are some key things that will help an HDInsight cluster manageable and perform better. The following best practices items should be noted.
Do not use only one storage account for a given HDInsight cluster. For a 48 node cluster, Microsoft is recommending 4-8 storage accounts. Not because of the storage space but what each storage account provides additional networking bandwidth that opens up the pipe as wide s possible for the compute nodes to finish their jobs faster. Make the naming convention of the storage account as random as possible, no prefix.
This is to reduce the chances that you hit storage bottlenecks or common mode failures in storage across all storage accounts at the same time. This type of storage partitioning in WASB is meant to avoid storage throttling. Use D13 for head nodes, D12 for worker nodes. When containers are created, make sure to only have one container per storage account. This yields better performance. The Hive metastore that comes by default when HDInsight is deployed is transient. When the cluster is deleted, Hive metastore gets deleted as well. Use Azure DB to store the Hive metastore so that it persists even when the cluster is blown away. Azure DB is basically SQL Server under the hood. Unless the cluster created is brand new every time and won't create the same tables, then Azure DB is not needed. When scaling down the cluster, some services stop and has to be started manually. Scaling should be done when there are no jobs running as much as possible. HDFS namespace recognizes both local storage and WASB storage. It is recommended not to change the Data Node directory in HDFS configuration (that points to the local SSD storage). NameNodes are not exposed from HDInsight so can't use distcp to transfer data from a remote cluster to HDInsight. Use WASB driver as much as possible to transfer data from on-premise cluster to HDInsight cluster since it yields better performance. One thing to note is that only Hadoop services can be stopped. VMs are not exposed and cannot be paused. If the goal is to reduce cost of a running environment, it's better to delete the cluster and recreate them when needed.
... View more
Labels: