About kgautam

kgautam · ‎07-15-2018

Key Take away : 1. For a hiveServer2 client the connection time seen is the total time to interact with AD(TGT + Zookeper Service ticket + HiveServer2 Service Ticket) + Zookeeper + HiveServer2 (mysql + YARN allocation). 2. In case your AD is slow, the hive connection will take long time. 3. time beeline -u "zookeeper String" -e "select 1" can be used to find how much time the beeline is taking. 4. In general it takes 4 to 10 seconds for connection to establish. 5. Neither AD, zookeeper or HiveServer2 can ever deny a connection, the connection time can be more but it can never be denied ideally. 6. Clients can only have a timeout (configurable parameter in most of clients like HUE, SAS, Alation, health check scripts ), as neither zookeeper of HS2 can ever deny a connection. 7. HiveServer2 will try to allocate resources in YARN before acking that it has accepted connection. In case your queue is full The connection time will be impacted. 8. Kindly set hive.server2.tez.initialize.default.sessions=true on HS2 in case you want a connection to be accepted even without allocation YARN resource (As yarn resources are already allocated). 9. If you mention queue name in your JDBC string the connection will be accepted only after allocating resources in YARN: Reasons under which the connection time is more. 1. AD is slow 2. Zookeeper is having too many connection issue or zookeeper is slow 3. HiveServer2 interaction with Mysql is slow 4. Huge GC is happening within HiveServer2 or Zookeeper. 5. HS2 can deny a connection if it has exhausted all its handler-thread. 6. Zookeeper can deny a connection if has reached to its max rate limit from a host. https://community.hortonworks.com/articles/51191/understanding-apache-zookeeper-connection-rate-lim.html 7. Mysql slowness can directly impact the HIveServer2. 8. Mysql is reaching max_connection limit 9. Network is slow. 10. HiveServer2 does a lot of retries for every service it talks to(atlas, solr, kafka, msql, datanode, namenode, RM) keep an eye of any retries thats happening. The various ways to find the time for individual steps are 1. Run Beeline in debug mode. https://community.hortonworks.com/content/supportkb/150574/how-to-enable-debug-logging-for-beeline.html 2. strace -t beeline -u "Zookeeper JDBC string" -e "select 1"

kgautam · ‎03-16-2018

Looks like there are no free resources available in YARN and the job state is not promoted fro ACCEPTED to RUNNING. Can you please verify of free conatiners available in the queue

kgautam · ‎03-07-2018

Please go through the basic understanding of InputFormats and Record Readers http://bytepadding.com/big-data/map-reduce/how-records-are-handled-map-reduce/ http://bytepadding.com/big-data/map-reduce/how-records-are-handled-map-reduce/ Example of custom Input Formats http://bytepadding.com/big-data/spark/combineparquetfileinputformat/ Few pointers: 1. Start with basic understanding of Splits, InputFormats, Record Readers, File formats and compression. 2. Go through the code of TextInputFormat : http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.1.0-pre7/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java 3. FileInputFormat is the abstract class or the Base class for all input formats go through the basic functionality http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.1.0-pre7/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#FileInputFormat 4. Decide upon what is the logical record for your InputFormat and whats the splitting Strateggy, depending on this extend the FileInputFormat and override/ implement getSplits() and getRecordReader() methods. FileInputFormat important method: getSplits() : each task will read one split, what is the start file index and end inex for this split getRecordReader() : the split being read how bytes needs to be converted into bytes.

kgautam · ‎02-25-2018

1. What is the file format 2. Kindly provide the job conf parmateres as per the code. http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-core/2.7.0/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java Need whats the value for them. long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job)); long maxSize = getMaxSplitSize(job);

kgautam · ‎02-23-2018

How HDFS Apply Ranger Policies Apache Ranger. Apache Ranger™ is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform. The vision with Ranger is to provide comprehensive security across the Apache Hadoop ecosystem. With the advent of Apache YARN, the Hadoop platform can now support a true data lake architecture. Enterprises can potentially run multiple workloads, in a multi tenant environment. Data security within Hadoop needs to evolve to support multiple use cases for data access, while also providing a framework for central administration of security policies and monitoring of user access. Ranger Goals Overview Apache Ranger has the following goals: Centralized security administration to manage all security related tasks in a central UI or using REST APIs. Fine grained authorization to do a specific action and/or operation with Hadoop component/tool and managed through a central administration tool Standardize authorization method across all Hadoop components. Enhanced support for different authorization methods - Role based access control, attribute based access control etc. Centralize auditing of user access and administrative actions (security related) within all the components of Hadoop. Ranger maintains various type of rule mapping the general layout looks like 1. User -> groups -> policy -> actual Resource(hdfs, hive tables) access/deny/allowed/read/write 2. User -> policy -> actual Resource(hdfs, hive tables) access/deny/allowed/read/write Key Take away of Ranger 1. Ranger is not an Identity management system, its a service which hold the policy mappings 2. Ranger is least worried about the user name and group names actual relation. 3. You can create a dummy group and attach it to a user, ranger is not bothered if this relationship exsist in LDAP or not 4. Ranger users and groups are snynced from the same LDAP which powers the rest of Hadoop cluster. 5. Its is the common ldap shared between Ranger and Hadoop cluster which enables them to see the same user. 6. No where Ranger claims that it knows all the user present on the cluster, its the job of Ranger user to sync users and groups to Ranger. Namenode: The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. Key Take aways. 1. Namenode is the place where meta info of the file is hdfs maintained. 2. While reading or writing a file, hdfs clients interact with namenode, to get the location of flie blocks on various datanodes and eventually interact with datanodes. 3. All file permission checks happen at namenode for HDFS. 4. Namnode maintains a POSIX style permission user : group : other but also supports fine grained access by applying Hadoop ACLS. Please follow the following link to have a interesting perspective of HDFS compared to Linux ext3. 5. dfs.namenode.acls.enabled = true enables ACLs on namenode. 6. To know more about hadoop ACLS follow the link. 7. Hadoop POSIX permission are not sufficient to decide all possible permission applicable on a given file of directory. 8. For setting unsettings acls use hdfs dfs -setfacl and hdfs dfs -getfacl How Namenode and Ranger Interacts HDFS permission checks happens on hdfs client interaction with Namenode. Namenode has his own ACLs and Ranger policies to apply. The application of permission starts with Ranger ACLs, and then to hadoop ACLs. How it all works (doAs=true impersonation enabled). 1. Ranger policies are fetched by Namenode and maintained in local cache. Do realize hdfs ranger plugin is not a separate process, but a lib which is executed along with Namenode. 2. User Authenticates to Namenode using one of the specified authenticating mechanism simple, Kerberos. 3. Namenode gets the username during the authentication phase. Do remember even with Kerberos Authentication groups available in the ticket are never used. 4. Based on how core-site.xml if configured Namenode either lookups LDAP to fetch groups of the authenticated user OR it does a lookup from the underlying OS (NSS -> SSSD -> LDAP) to fetch the groups. 5. Once groups are fetched, Namenode has mapping of user to groups of authenticated user. 6. Hdfs Ranger plugin has mapping of user -> groups -> policy, now the groups which were fetched from namenode are used to select the ranger policy and enforce them. 7. Just realize ranger might provide a relation of user 1 -> mapper to 3 groups -> 3 groups mapped to 3 policies. Not all the policies, mapped to the three groups will be applied by default. 8. Namenode will fetch the groups at its own end (LDAP or through OS) and only the overlapping groups with the ranger groups rules will be used while enforcing the policies. 9. Namenode (hdfs ranger plugin lib) will write audit logs locally which is eventually pushed to ranger service (solr). 10. If due to some reason groups are not fetched from Namenode for the authenticated user all the Ranger policies mapped to those groups will not be applied. 11. Sometime mapping user to policies directly help mitigating issues in case LDAP is not working correctly. 12. Do realize all the mapping here are in terms of group names and not gid. As there can be scenario that gid is available on the OS but no groups. 13. IF there are no ranger policies for the user then Hadoop ACLs are applied and appropriate permission is enforced. Config hdfs-site.xml Dis.namenode.inode.attributes.provider.class org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer RangerHdfsAuthorizer => calls checkPermission => which internally calls gets groups of the authenticated user using UserGroupInformation class. Code Flow : HDFS Authorization https://github.com/apache/ranger/blob/master/hdfs-agent/src/main/java/org/apache/ranger/authorization/hadoop/RangerHdfsAuthorizer.java#L64 checkPermission : from userName get groups and check privilidges. https://github.com/apache/ranger/blob/master/hdfs-agent/src/main/java/org/apache/ranger/authorization/hadoop/RangerHdfsAuthorizer.java#L200 String user = ugi != null ? ugi.getShortUserName() : null; Set<String> groups = ugi != null ? Sets.newHashSet(ugi.getGroupNames()) : null; Get Groups from Username. Set<String> groups = ugi != null ? Sets.newHashSet(ugi.getGroupNames()) : null; https://github.com/apache/ranger/blob/master/hdfs-agent/src/main/java/org/apache/ranger/authorization/hadoop/RangerHdfsAuthorizer.java#L208 UserGroupInformation : Core class to authenticate the users and get groups (Kerberos authentication, LDAP, PAM) . Get groups from User. http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.action/0.2.0/org/apache/hadoop/security/UserGroupInformation.java#221 Groups : if nothing is mentioned in core-site.xml then call invoke a shell and get groups for the use. http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-common/2.7.0/org/apache/hadoop/security/Groups.java#Groups.getUserToGroupsMappingService%28org.apache.hadoop.conf.Configuration%29 ShellBasedUnixGroupsMapping: default Implementation http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-common/2.7.0/org/apache/hadoop/security/ShellBasedUnixGroupsMapping.java#ShellBasedUnixGroupsMapping

kgautam · ‎02-22-2018

How HiveServer2 Apply Ranger Policies Apache Ranger. Apache Ranger™ is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform. The vision with Ranger is to provide comprehensive security across the Apache Hadoop ecosystem. With the advent of Apache YARN, the Hadoop platform can now support a true data lake architecture. Enterprises can potentially run multiple workloads, in a multi tenant environment. Data security within Hadoop needs to evolve to support multiple use cases for data access, while also providing a framework for central administration of security policies and monitoring of user access. Ranger Goals Overview Apache Ranger has the following goals: Centralized security administration to manage all security related tasks in a central UI or using REST APIs. Fine grained authorization to do a specific action and/or operation with Hadoop component/tool and managed through a central administration tool Standardize authorization method across all Hadoop components. Enhanced support for different authorization methods - Role based access control, attribute based access control etc. Centralize auditing of user access and administrative actions (security related) within all the components of Hadoop. Ranger maintains various type of rule mapping the general layout looks like 1. User -> groups -> policy -> actual Resource(hdfs, hive tables) access/deny/allowed/read/write 2. User -> policy -> actual Resource(hdfs, hive tables) access/deny/allowed/read/write Key Take away of Ranger 1. Ranger is not an Identity management system, its a service which hold the policy mappings 2. Ranger is least worried about the user name and group names actual relation. 3. You can create a dummy group and attach it to a user, ranger is not bothered if this relationship exsist in LDAP or not 4. Ranger users and groups are snynced from the same LDAP which powers the rest of Hadoop cluster. 5. Its is the common ldap shared between Ranger and Hadoop cluster which enables them to see the same user. 6. No where Ranger claims that it knows all the user present on the cluster, its the job of Ranger user to sync users and groups to Ranger. How HiveServer2 and Ranger Interacts How it all works (doAs=true impersonation enabled). 1. Ranger policies are fetched by HiseServer2 and maintained in local cache. Do realize hive ranger plugin is not a separate process, but a lib which is executed along with HivseServer2. 2. User Authenticates to Hiveserver2 using one of the specified authenticating mechanism LDAP, Kerberos, PAM etc 3. HiveServer2 gets the username during the authentication phase. Do remember even with Kerberos Authentication groups available in the ticket are never used. 4. Based on how core-site.xml if configured hivserver2 either lookups LDAP to fetch groups of the authenticated user OR it does a lookup from the underlying OS (NSS -> SSSD -> LDAP) to fetch the groups. 5. Once groups are fetched, Hiverser2 has mapping of user to groups of authenticated user. 6. HiveServer2 Ranger plugin has mapping of user -> groups -> policy, now the groups which were fetched from hiveserver2 are used to select the ranger policy and enforce them. 7. Just realize ranger might provide a relation of user 1 -> mapper to 3 groups -> 3 groups mapped to 3 policies. Not all the policies, mapped to the three groups will be applied by default. 8. Hiveserver2 will fetch the groups at its own end (LDAP or through OS) and only the overlapping groups with the ranger groups rules will be used while enforcing the policies. 9. Hiveserver2 (ranger plugin lib) will write audit logs locally which is eventually pushed to ranger service (solr). 10. If due to some reason groups are not fetched from HiveServer2 for the authenticated user all the Ranger policies mapped to those groups will not be applied. 11. Sometime mapping user to policies directly help mitigating issues in case LDAP is not working correctly. 12. Do realize all the mapping here are in terms of group names and not gid. As there can be scenario that gid is available on the OS but no groups. Config Hiveserver2-site. hive.security.authorization.manager. org.apache.ranger.authorization.hive.authorizer.RangerHiveAuthorizerFactory RangerHiveAuthorizerFactory => calls RangerHiveAuthorizer => which internally calls a checkPrivileges() method , which subsequently gets groups of the authenticated user using UserGroupInformation class. Code Flow : Ranger Authorization https://github.com/apache/ranger/blob/master/hive-agent/src/main/java/org/apache/ranger/authorization/hive/authorizer/RangerHiveAuthorizerFactory.java https://github.com/apache/ranger/blob/master/hive-agent/src/main/java/org/apache/ranger/authorization/hive/authorizer/RangerHiveAuthorizer.java checkPrivileges : from userName get groups and check prividigles. https://github.com/apache/ranger/blob/master/hive-agent/src/main/java/org/apache/ranger/authorization/hive/authorizer/RangerHiveAuthorizer.java#L225 UserGroupInformation ugi = getCurrentUserGroupInfo(); https://github.com/apache/ranger/blob/master/hive-agent/src/main/java/org/apache/ranger/authorization/hive/authorizer/RangerHiveAuthorizer.java#L220 Get Groups from Username. Set<String> groups = Sets.newHashSet(ugi.getGroupNames()); UserGroupInformation mUgi = userName == null ? null : UserGroupInformation.createRemoteUser(userName); https://github.com/apache/ranger/blob/master/hive-agent/src/main/java/org/apache/ranger/authorization/hive/authorizer/RangerHiveAuthorizerBase.java#L65 getCurrentUserGroupInfo() https://github.com/apache/ranger/blob/master/hive-agent/src/main/java/org/apache/ranger/authorization/hive/authorizer/RangerHiveAuthorizerBase.java#L92 UserGroupInformation : Core class to authenticate the users and get groups (Kerberos authentication, LDAP, PAM) . Get groups from User. http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.action/0.2.0/org/apache/hadoop/security/UserGroupInformation.java#221 Groups : if nothing is mentioned in core-site.xml then call invoke a shell and get groups for the use. http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-common/2.7.0/org/apache/hadoop/security/Groups.java#Groups.getUserToGroupsMappingService%28org.apache.hadoop.conf.Configuration%29 ShellBasedUnixGroupsMapping: default Implementation http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-common/2.7.0/org/apache/hadoop/security/ShellBasedUnixGroupsMapping.java#ShellBasedUnixGroupsMapping

kgautam · ‎02-22-2018

User and Groups significance in HDFS Before we even start lets take a look back on how user and groups are handled in Linux. The key take away from the previous articles were lets Look at the various relationship that exists 1. Every group has a group id. 2. Every user has a user id 3. In linux its not possible to have user without a group id.(by default when a user is created , it has a group with same name) 4. A user can have one primary group and multiple secondary groups. 5. A group can have multiple users. 6. Authentication is done based on username and password. 7. Authorization is done based on groups as unix follow POSIX permission for user : group : others 8.A user cannot exist without a group. 9. A group can exist without a user. 10. A file can only have usernames and groups which are part of the Linux OS (Local or Remote service) 11. A file ownership can never be changed to a non existent user ( Create a file and try chown XXXXXX fileName ). 12. Linux is applying authorization policy not only during reading the file but also while creating the file. 13. In linux system there can be no resource which is being handled by a random user which the OS is not aware of. 14. OS maintains(locally or LDAP) a table of user and groups, and will never allow a user outside of this mapping to create, delete or own a file. Lets try creating a file on hdfs. 1. Lets change our current user to hdfs on Locally by sudo su hdfs. 2. hdfs is the superuser for HDFS filesystem , like root is the super user on Linux file System. 3. Lets create a dir in hdfs. hadoop dfs -mkdir /tmp/testDir 4. Change the ownership of /tmp/kunal to random user and group by hadoop dfs -chown XXXX:YYYYY /tmp/kunal 5. Doing a ls on hdfs "/tmp" by hadoop dfs -ls /tmp | grep testDir which will display drwx-xr-x - XXXX YYYY 0 2018-02-20 11:00 /tmp/testDir Key take aways. 1. hdfs user is the super user in hdfs. 2. HDFS has no strict policy regarding user and groups like your linux OS. 3. You interact with HDFS through hdfs client, hdfs client takes the username of the user through which it was run on the linux OS. 4. HDFS always checks for permissions while reading a file, while creating or chown it does no check who is creating the files. 5. Your linux OS users in a way are related to the user on HDFS, as your hdfs clients pickup the Linux user through which it was run. 6. HDFS provides two kind of security mapping POSIX and ACLS: and its for ACLS that it requires user to group mapping to be made available to it. 7. In the HDFS file system user and group are not as tight coupled as Linux. 8. User Identity is never maintained with the HDFS, the user identity mechanism is extrinsic to HDFS itself. There is no provision within HDFS for creating user identities, establishing groups, or processing user credentials. HDFS The Hadoop Distributed File System (HDFS) implements a permissions model for files and directories that shares much of the POSIX model. Each file and directory is associated with an owner and a group. The file or directory has separate permissions for the user that is the owner, for other users that are members of the group, and for all other users. For files, the r permission is required to read the file, and the w permission is required to write or append to the file. For directories, the r permission is required to list the contents of the directory, the w permission is required to create or delete files or directories, and the x permission is required to access a child of the directory.In contrast to the POSIX model, there are no setuid or setgid bits for files as there is no notion of executable files. For directories, there are no setuid or setgid bits directory as a simplification. The sticky bit can be set on directories, preventing anyone except the superuser, directory owner or file owner from deleting or moving the files within the directory. Setting the sticky bit for a file has no effect. Collectively, the permissions of a file or directory are its mode. In general, Unix customs for representing and displaying modes will be used, including the use of octal numbers in this description. When a file or directory is created, its owner is the user identity of the client process, and its group is the group of the parent directory (the BSD rule).HDFS also provides optional support for POSIX ACLs (Access Control Lists) to augment file permissions with finer-grained rules for specific named users or named groups. ACLs are discussed in greater detail later in this document.Each client process that accesses HDFS has a two-part identity composed of the user name, and groups list. Whenever HDFS must do a permissions check for a file or directory foo accessed by a client process, If the user name matches the owner of foo, then the owner permissions are tested; Else if the group of foo matches any of member of the groups list, then the group permissions are tested; Otherwise the other permissions of foo are tested. If a permissions check fails, the client operation fails. Hadoop Groups Mapping for details.

kgautam · ‎02-22-2018

Linux Authentication and Authorization Mechanism Security plays a very important role in making any software enterprise ready. Linux has a robust security architecture yet it allows plug-gable module to connect to any external Identity manager for authenticating and authorizing users. The two most important modules that play a very important role in providing the security feature at OS level are 1. PAM 2. NSS PAM: The Pluggable Authentication Module allows integration of various authentication technologies such as standard UNIX, RSA, DCE, LDAP etc. into system services such as login, passwd, rlogin, su, ftp, ssh etc. without changing any of these services. First implemented by Sun Solaris, PAM is now the standard authentication framework of many Linux distributions, including RedHat and Debian. It provides an API through which authentication requests are mapped into technology specific actions (implemented in the so called pam modules). This mapping is done by PAM configuration files, in which, for each service are basically given the authentication mechanisms to use. In our case, the pam_ldap module, implemented in the shared library pam_ldap.so, allows user and group authentication using an LDAP service. Each service that needs an authentication facility, can be configured through the PAM configuration files to use different authentication methods. This means that it is possible, using the PAM configuration files, to write a custom list of requirements that an user must satisfy to obtain access to a resource. NSS: Once an user is authenticated, many applications still need access to user information. This information is traditionally contained in text files (/etc/passwd, /etc/shadow, and /etc/group) but can also be provided by other name services. As a new name service (such as LDAP) is introduced it can be implemented either in the C library (as it was for NIS and DNS) or in the application that wants to use the new nameservice. Anyway, this can be avoided using a common, general purpose, name service API and by demanding to a set of libraries the task of retrieving this information performing technology based operations. This solution was adopted in the GNU C Library that implements the Name Service Switch, a method originated from the Sun C library that permits to obtain information from various name services through a common API. NSS uses a common API and a configuration file (/etc/nsswitch.conf) in which the name service providers for every supported database are specified. The databases currently supported by NSS [2] are: aliases: Mail aliases. ethers: Ethernet numbers. group: Groups of users. hosts: Host names and numbers. netgroup: Network wide list of host and users. network: Network names and numbers. protocols: Network protocols. passwd: User passwords. rpc: Remote procedure call names and numbers. services: Network services. shadow: Shadow user passwords. Using the nss_ldap shared library it is possible to implement the maps above using LDAP, anyway here I'll focus only on the LDAP implementation of shadow, passwd and group database tough all the maps above can be implemented. For most of the other maps it is even unadvisable to store them in ldap, as they tend not to change too often, so it is not a problem to have them locally as files, and storing them in ldap would cause some minor performance loss. Key Take aways 1. NSS can lookup the local configs of /etc/groups, /etc/passwd for user , uid, group, gid mappings or it can lookup to LDAP to do the same. 2. PAM can lookup /etc/passwd for retrieving the username and password for authentication of lookup a third source. 3. As both modules can talk to external source, what it means is there can uid and gid visisble on the Linux which are not present in local confs of /etc/passwd, /etc/groups. 4. This provides an ability to manage all users, groups and password on LDAP, and both PAM and NSS pointing to LDAP (remote source). 5. On the OS all the uid and gid visible are being fetched from LDAP and have no presence in the local configs. 6. The most important thing to appreciate is the applications are transparent from where the users and groups are coming, as the applications makes a call to PAM or NSS and eventually hit the remote source (LDAP) 7. This provides an opportunity to maintain all the users and groups on LDAP and not locally on the machine. 8. Do remeber both PAM and NSS will have a logic to resolve names as in a uid is present in /etc/passwd , git in /etc/groups and with same uid on LDAP , based on the config the conflict will be resolved. Generally the local config wins over the remote sorurce. 9. An application can make a call to LDAP independently too, there is no restriction to always go through PAM , NSS for reaching out to LDAP. As users are maintained in a remote Service, making lookups every time becomes expensive . PAM and NSS evntually started becoming very complex hence tools like SSSD, VASD came into being. SSSD The sssd daemon (Running locally on the Linux OS) acts as the spider in the web, controlling the login process and more. The login program communicates with the configured pam and nss modules, which in this case are provided by the SSSD package. These modules communicate with the corresponding SSSD responders, which in turn talk to the SSSD Monitor. SSSD looks up the user in the LDAP directory, then contacts the Kerberos KDC for authentication and to aquire tickets. (PAM and NSS can also talk to LDAP directly using pam_ldap and nss_ldap respectively. However SSSD provides additional functionality.) Of course, a lot of this depends on how SSSD has been configured; there lots of different scenarios. For example, you can configure SSSD to do authentication directly with LDAP, or authenticate via Kerberos. The sssd daemon does not actually do much that cannot be done with a system that has been "assembled by hand", but has the advantage that it handles everything in a centralised place. Another important benefit of SSSD is that it caches the credentials, which eases the load on servers and makes it possible to go offline and still login. This way you don't need a local account on the machine for offline authentication. In a nutshell SSSD is able to provide what nss_ldap, pam_ldap, and pam_krb, and ncsd used to provide in a seamless way. Lets look at who PAM, NSS integrates with SSD. Key take aways. 1. PAM, NSS and SSSD/VASD are present locally on your Linux OS. 2. Any call made to OS for authenticating or authorization results in a call go PAM/NSS eventually to SSD and eventually to AD or LDAP. 3. SSD can integrate with LDAP, AD, KDC . 4. Three layers are completely transparent to the OS applications. 5. SSSD/VASD maintains a cache locally on the OS. 6. SSSD/VASD will lookup both in the external source and locally to get user -> password or user name to -> uid , uid-> username, group name to gid, gid-> group name etc. 7. getent passwd, getent groups command do show the source from where its fetching the info

kgautam · ‎02-21-2018

The two most important aspect of security is 1. Authentication 2. Authorization Authentication : the process of ascertaining that somebody really is who he claims to be. (who are you) Authorization : the process of verifying that you can access to something. (Are you allowed to access the resource) Lets take example of a user logging into a Linux machine (ssh / terminal login). One needs to authenticate himself username and password thus verifying he is the person who he claims to be . The same user might not be authorized to access a file as he dosent have enough permissions to read/write on the file. The idea can be extended to even other services like, one can login (authenticate ) into booking.com if he has a profile, but is not authroized to change the prices of the flights, only admins are allowed to do that. Hence authentication and authorization play a key role in determining the security aspects of a service. Lets see How authentication and authorization are implemented in Linux OS. The three main important file from security perspective are 1. /etc/passwd 2. /etc/groups 3. /etc/shadow An excerpt from /etc/passwd username : x : uid : gid : user info : home dir : shell to use. Things to know : 1. x denotes the encrypted password is saved into /etc/shadow file. 2. gid present here is the primary group id of the user. A user can be part of multiple groups, but the one present in the /etc/passwd is his primary group An excerpt from /etc/groups group name : password : gid : group List Things to know 1. password is generally not used , but we can have password for a group too. 2. The group list refers to the list of user names. These user have these groups as the secondary group. lets Look at the various relationship that exists 1. Every group has a group id. 2. Every user has a user id 3. In linux its not possible to have user without a group id.(by default when a user is created , it has a group with same name) 4. A user can have one primary group and multiple secondary groups. 5. A group can have multiple users. 6. Authentication is done based on username and password. 7. Authorization is done based on groups as unix follow POSIX permission for user : group : others Some important linux commands. 1. sudo adduser user: adds a user with the groupname as user name. In Linux a user cannot exist without a group. 2. id username : uid=1001(foobar) gid=1001(foobar) groups=1001(foobar), 4201(security) to get groups of a user (/etc/passwd has this info). For uid foobar, group foobar (gid 1001) is the primary group, security(4201) is the secondary group 3. groups username: gets all the user than belong to this group (/etc/groups has this info) 4. To change primary group of a user use : sudo usermod -g Username groupname 5. getent passwd and getent groups can also be used to lookup the info, it also provides the source from where the info is looked from. The Linux OS security architecture is very restrictive. The various aspects are 1. A user cannot exist without a group. 2. A group can exist without a user. 3. A file can only have usernames and groups which are part of the Linux OS (Local or Remote service) 4. A file ownership can never be changed to a non existent user ( Create a file and try chown XXXXXX fileName ). 5. Linux is applying authorization policy not only during reading the file but also while creating the file. 7. In linux system there can be no resource which is being handled by a random user which the OS is not aware of.

kgautam · ‎02-20-2018

Lets start with having a quick understanding of a file saved in Linux OS 1. vim /etc/fstab to know the filesystem of the disk. 2. create and save a file using vim. 3. filefrag -v filename Facts to appreciate. 1. Filefrag commands will give you an idea of how many blocks( OS blocks formed by grouping sectors) the file occupies. block is the minimum data that can be read by the OS . Data is fetched from hard disk in multiples of the blocks. 2. The blocks might not be contiguous( can you corelate something to HDFS). 3. The filesystem has no idea about a record saved in the file or the format of the file. For the filesystem file is a sequence of bytes. 4. Number of blocks occupied on Linux FileSystem is filesize/filesystem block size , which mostly is file size/ 4096 for ext3/4. 5. Record is a logical entity which, only the reader and writer of the file and can understand. Filesystem has no clue about the same. 6. We have standardized the process of identifying a record with file format. 7. Example In a file of "text" format record is the sequence of byte contained within two \n. 8. The editors we use have this logic inbuilt in them example vim is a text editor, which has this notion of \n determining records, which is part of its code base. 9. A record can be spread across two blocks, as while dividing the file into blocks, filesystem dosent consider anything about the notion of records. The file in the hadoop world. When a file is stored in hadoop filesystem which is distributed in nature, following facts need to be appreciated. 1. HDFS is distributed in nature. 2. HDFS uses the OS file system on the individual nodes to store data. On individual node a HDFS blocks, is saved as multiple OS blocks on hard disk. HDFS blocks in itself is a higher level abstraction. One hdfs block (128 MB) on a given node comprises of multiple OS blocks(4K bytes) which is made of sectors(512 bytes) on the hard-disk. 3. The file is divided into HDFS blocks by: Num blocks = fileSize/ Hadoop block size. 4. The File blocks might reside on the same node or distributed across multiple nodes. 5. The HDFS has no idea of the notion of record or file format. For HDFS file is just a sequence of bytes that needs to be stored. 6. As there is no notion of records, it is very much possible that a record is split across two blocks. 7. To view the blocks of a file in HDFS use hadoop fsck filePath -file -blocks. The Question arises who takes care of honoring the notion of records and file formats on hadoop. 1. InputFormats. 2. RecordReaders. InputFormat is the code which knows how to read a specific file, hence this is the code which helps you reading a specific file format. Just like we use vim to open text files, adobe to open pdf format files, similary we use TextInputformat to read text files saved on HDFS, SequenceFileInput Format to read sequence files in hadoop. InputFormat holds the logic of how the file has been split and saved, and which recordreader will be used to read the record in the splits. Once we have the InputFormats the next logical question is which component decides how the sequence of bytes read to be converted into records. RecordReader is the code which understands how to logically form a record from the stream of read bytes. Lets take example of a Text file and try to understand the concepts in-depth. Let there be a text file, the format itself says a record is formed by all the bytes between \n or \r, and the individual bytes in the record will be encoded in UTF-8. Lets assume a big text file is saved onto HDFS. The facts to understand is. 1. The file is broken into HDFS blocks and saved. The blocks can be spread across multiple nodes. 2. A record can be split between two hdfs blocks. 3. TextInput format : uses LinerecordReader 4. TextInputFormat uses the logic of fileSize/ HDFS blocksize ( getSplit Functionality ) to find the number of blocks the file consists of. 5. For each file block one can find the start byte and end byte index which is provided as the input to the record reader. 6. A record reader knows to read from byte index X to Byte index Y with certain conditions 1. If starting byte index X == 0 (starting of the file ) then include all the bytes till \n in the first record. 2. If the starting byte index X != 0 (All the blocks except the first block) then leave the byte till the first \n is encountered. 3. IF byte index Y = file size ( End of File) then do not read any further records. 4. if byte index Y != file size (The blocks excluding the last block) then go ahead and read extra record from the next block. What all the condition ensures is. 1. The recordreader which reads the first block of file consumes the first line. 2. All other record reader always skip the initial bytes till the first \n occurs.

Online	Offline
Last Visited	‎08-14-2019 06:45 PM

Member Since	‎10-02-2017 07:17 AM
Last Visited	‎08-14-2019 06:45 PM
Posts	112
Kudos received	70

Cloudera Community

Re: Accessing HBase Table through Hive is very slo...

Re: Yarn job stuck at Accepted state in a kerberiz...

Re: How to create a hadoop custom inputformat/file...

Re: Can columnar format occupy more space than row...

Re: Handling compression in Spark

HiveServer2 Connection Establishment Phase in deta...

Re: Yarn job stuck at Accepted state in a kerberiz...

Re: How to create a hadoop custom inputformat/file...

Re: Why mapred.min.split.size doesnt change the nu...

How HDFS Apply Ranger Policies

How HiveServer2 and Ranger Interact (Internals)

User and Groups significance in HDFS

How PAM NSS SSD work together on Linux OS

Authentication and Authorization on Linux OS

RecordReader : How Records from a file are read i...