Member since
06-20-2016
251
Posts
196
Kudos Received
36
Solutions
01-08-2017
06:44 PM
2 Kudos
In HDP 2.5, Zeppelin Notebook supports the ability to impersonate a user's security context when accessing a data set. This is critical to allow multi-tenant, fine-grained access using Ranger authorization policies. In order to support this, the user will first need to authenticate to the Zeppelin UI. (In Part 2, we harden this configuration to use LDAPS).
Zeppelin provides a few different authentication mechanisms (based on what the underlying Apache Shiro project supports); we'll use OpenLDAP in this walkthrough.
In this article, we'll assume you already have an OpenLDAP server installed in your environment, which contains the users to which you want to provide access to the Zeppelin UI (I plan to create a supporting article regarding OpenLDAP installation). We'll further assume that these users exist as local OS users (this is outside the scope of this article, but we can use a solution like SSSD for this).
Our organizational structure is as follows:
With this structure, our search base will be restricted to dc=field,dc=hortonworks,dc=com and our user DN template will be cn={0},ou=People,dc=field,dc=hortonworks,dc=com as the People OU contains the users.
The Zeppelin configuration in question can be found in the shiro_ini_content sub-section of the Advanced zeppelin-env section on the Config tab in Ambari. The configuration should like the below, with the search base, user DN template, and URL specific to your environment (my OpenLDAP server lives at sslkdc.field.hortonworks.com):
[main]
ldapRealm = org.apache.zeppelin.server.LdapGroupRealm
ldapRealm.contextFactory.environment[ldap.searchBase] = dc=field,dc=hortonworks,dc=com
ldapRealm.userDnTemplate = cn={0},ou=People,dc=field,dc=hortonworks,dc=com
ldapRealm.contextFactory.url = ldap://sslkdc.field.hortonworks.com:389
ldapRealm.contextFactory.authenticationMechanism = SIMPLE
sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager
securityManager.sessionManager = $sessionManager
securityManager.sessionManager.globalSessionTimeout = 86400000
shiro.loginUrl = /api/login
securityManager.realm = $ldapRealm
[roles]
admin = *
[urls]
/** = authc
Make sure you specify securityManager.realm = $ldapRealm as in the last line as this isn't part of the sample shiro.ini included with Zeppelin. After these changes are saved, make sure to restart Zeppelin. Now, navigate to the Zeppelin UI. With authentication enabled, you should not see any notebooks accessible until a user logs in. I'll now login using user1 (full DN is cn=user1,ou=People,dc=field,dc=hortonworks,dc=com, based on the userDnTemplate value above) and the password stored in OpenLDAP. After authenticating, I can navigate to the notebooks to which user1 has access: In the next article in this series, we'll show how to use this authenticated subject to support impersonated access to data assets.
... View more
Labels:
12-15-2016
09:42 PM
12 Kudos
Hortonworks Data Flow 2.1 was recently released and includes a new feature which can be used to connect to an Azure Data Lake Store. This is a fantastic use case for HDF as the data movement engine supporting a connected data plane architecture spanning on-premise and cloud deployments. This how-to will assume that you have created an Azure Data Lake Store account and that you have remote access to an HD Insights head node in order to retrieve some dependent JARs.
We will make use of the new Additional Classpath Resources feature for the GetHdfs and PutHdfs processors in NiFi 1.1, included within HDF 2.1. The following additional dependencies are required for ADLS connectivity:
adls2-oauth2-token-provider-1.0.jar
azure-data-lake-store-sdk-2.0.4-SNAPSHOT.jar
hadoop-azure-datalake-2.0.0-SNAPSHOT.jar
jackson-core-2.2.3.jar
okhttp-2.4.0.jar
okio-1.4.0.jar
The first three Azure-specific JARs can be found in /usr/lib/hdinsight-datalake/ on the HDI head node. The Jackson JAR can be found in /usr/hdp/current/hadoop-client/lib, and the last two can be found in /usr/hdp/current/hadoop-hdfs-client/lib .
Once you've gathered these JARs, distribute to all NiFi nodes and place in a created directory /usr/lib/hdinsight-datalake.
In order to authenticate to ADLS, we'll use OAuth2. This requires the TenantID associated with your Azure account. This simplest way to obtain this is via the Azure CLI, using the azure account show command.
You will also need to create an Azure AD service principal as well as an associated key. Navigate to Azure AD > App Registrations > Add
Take note of the Application ID (aka the Client ID) and then generate a key via the Keys blade (please note the Client Secret value will be Hidden after leaving this blade so be sure to copy somewhere safe and store securely).
The service principal associated with this application will need to have service-level authorization to access the Azure Data Lake Store instance that exists by assumption as a pre-requisite. This can be done via the IAM blade for your ADLS instance (please note you will not see the Add button in the top toolbar unless you have administrative access for your Azure subscription).
In addition, the service principal will need to have appropriate directory-level authorizations for the ADLS directories to which it should be authorized to read or write. These can be assigned via Data Explorer > Access within your ADLS instance.
At this point, you should have your TenantID, ClientID, and Client Secret available and we will now to be able to configure core-site.xml in order to access Azure Data Lake via the PutHdfs processor.
The important core-site values are as follows (note the variables identified with the '$' sigil below, including part of the refresh URL path).
<property>
<name>dfs.adls.oauth2.access.token.provider.type</name>
<value>ClientCredential</value>
</property>
<property>
<name>dfs.adls.oauth2.refresh.url</name>
<value>https://login.microsoftonline.com/$YOUR_TENANT_ID/oauth2/token</value>
</property>
<property>
<name>dfs.adls.oauth2.client.id</name>
<value>$YOUR_CLIENT_ID</value>
</property>
<property>
<name>dfs.adls.oauth2.credential</name>
<value>$YOUR_CLIENT_SECRET</value>
</property>
<property>
<name>fs.AbstractFileSystem.adl.impl</name>
<value>org.apache.hadoop.fs.adl.Adl</value>
</property>
<property>
<name>fs.adl.impl</name>
<value>org.apache.hadoop.fs.adl.AdlFileSystem</value>
</property>
We're now ready to configure the PutHdfs processor in NiFi.
For Hadoop configuration resources, point to your modified core-site.xml including the properties above and an hdfs-site.xml (no ADLS-specific changes are required).
Additional Classpath Resources should point to the /usr/lib/hdinsight-datalake to which we copied the dependencies on all NiFi nodes.
The input to this PutHdfs processor can be any FlowFile, it may be simplest to use the GenerateFlowFile processor to create the input with some Custom Text such as
The time is ${now()}
When you run the data flow, you should see the FlowFiles appear in the ADLS directory specified in the processor, which you can verify using the Data Explorer in the Azure Portal, or via some other means.
... View more
Labels:
11-17-2016
09:27 PM
9 Kudos
SmartSense 1.3 includes Activity Explorer, which hosts prebuilt notebooks that visualize cluster utilization data related to user, queue, job duration, and job resource consumption, including an HDFS Dashboard notebook. This dashboard helps operators better understand how HDFS is being used and which users and jobs are consuming the most resources within the file system. It's important to note that the source data for ACTIVITY.HDFS_USER_FILE_SUMMARY comes from fsimage, which does not contain file- and directory-level information. Many operators are also interested in more fine-grained analytics regarding cluster data use, which can drive decisions such as storage tiering using HDFS heterogeneous storage. Since these data are not available in fsimage, we will use the Ranger audit data for HDFS which the Ranger plugin writes during authorization events. The best practice is for the plugin to write these data to both Solr (for short-term use, driving performance in the UI) as well as HDFS for long-term storage. Please note the principal used in the GetHdfs processor will need read access for the HDFS directory storing the Ranger audit data. The audit data, after some formatting for readability, looks like: We will create a NiFi dataflow (ranger-audit-analytics.xml) to shred this JSON data into a Hive table, please see the below and the attached template. We first use GetHDFS to pull the audit data file, and then split the flowfile by line as each line contains a JSON fragment. EvaluateJsonPath is used to pull particular attributes that are valuable for analytics: We use ReplaceText to create the DDL statements to populate our Hive table: And finally, we use PutHiveQL to execute these INSERT statements. Once we've loaded these data into Hive, we're ready to use Zeppelin to explore and visualize the data. For instance, let's take a look at most frequently accessed directories: As another example, we can see the last time a particular resource was accessed: These visualizations can be combined with the HDFS Dashboard ones for a more robust picture of HDFS-related activity on a multi-tenant cluster. Hive Table Schema: create external table audit
(reqUser string,
evtTime timestamp,
access string,
resource string,
action string,
cliIP string
)
ROW FORMAT DELIMITED
STORED AS ORC
LOCATION '/user/nifi/audit';
... View more
Labels:
11-09-2016
08:07 PM
@Slim Jones please try using Thrift Transport mode option with HTTP and let us know if that works for you.
... View more
11-04-2016
09:05 PM
My pleasure, @Kate Shaw. Yes, that's right, you would specify org.apache.ranger.plugin.conditionevaluator.RangerTimeOfDayMatcher as the evaluator value within the policyConditions array as above (with evaluatorOptions empty in this case). Policy conditions are registered via the HTTP PUT call as above, with the full JSON document as the payload.
... View more
11-01-2016
05:43 PM
Hi @Kate Shaw, you bet. I edited the answer to include the details regarding the policy condition value, it is mapped to LOCATION_COUNTRY_CODE in this case. For Time of Day capabilities, you would register a policy condition using the RangerTimeOfDayMatcher evaluator. The time of day pattern value would look like, for example, "9 AM - 5 PM".
... View more
10-26-2016
11:05 PM
7 Kudos
Our scenario for this walkthrough is as follows: we have a customer table that contains fields for Zip Code, MRN, and Blood Type. Per policy, users in the analyst group cannot access MRN and Blood Type together with Zip Code within the same query, as this would deanonymize sensitive Personal Health Information. In order to make use of Ranger functionality to achieve this, we'll need to register a new policy condition using the Ranger API. Please see my HCC post on this topic for further details. In this case, the policy condition will have the following form (please note itemId value is specific to one's environment) within the policyConditions array contained in the /servicedef/name/hive resource: {
"itemId": 1,
"name": "resources-accessed-together",
"evaluator": "org.apache.ranger.plugin.conditionevaluator.RangerHiveResourcesAccessedTogetherCondition",
"evaluatorOptions": {},
"label": "Resources Accessed Together?",
"description": "Resources Accessed Together?"
} The RangerHiveResourcesAccessedTogetherCondition evaluator is included with Ranger. Once this condition is registered using the Ranger API, we can make use of it within a Deny condition for a resource-based policy in Ranger. The policy will be associated with the zipcode field in our ww_customer table. We then need to associate the Blood Type and MRN fields with the resources-accessed-together policy condition we registered above as Deny conditions. Now when joe_analyst, a user in the analyst group, attempts to access these combined fields, they will be denied: Please note that joe_analyst can query, say, Zip Code and Blood Type together, as no patient identifier like MRN is in play:
... View more
Labels:
10-19-2016
06:16 PM
9 Kudos
Apache Knox provides a gateway to HiveServer2 which can be used to proxy connections to Hive from BI tools like Tableau Desktop. With a secure cluster HiveServer2 is kerberized, which usually means the client must have a valid TGT in their ticket cache in order to authenticate to HiveServer2. The Knox gateway can simplify this in many environments, by supporting client authentication to kerberized Hive without a TGT. The client authenticates to Knox via, say, LDAP authentication, and HiveServer2 trusts Knox to proxy connections on behalf of this user. Another benefit of using Knox is that client doesn't need to know where HiveServer2 is hosted, so if an administrator needs to move it to another master node, this is transparent to the applications connecting via Knox. In the environment for this walkthrough, I have Tableau Desktop 10.0.1 installed, and I'll be using HDP 2.5. The first step is to ensure you have the latest Hortonworks ODBC Driver for Apache Hive, 2.1.5 at the time of this writing, which can be downloaded from https://hortonworks.com/downloads/ I will assume that Knox has already been installed. If you don't have a LDAP server in your environment for testing, Ambari includes a Demo LDAP service that can be started from the Knox service. Further, we assume Ranger is being used for Hive authentication (so Hive impersonation is disabled, i.e., hive.server2.enable.doAs is false). In order to configure Hive for Knox, we'll need to change the transport mode (hive.server2.transport.mode) to 'http' (it is set to 'binary' by default) and then restart Hive. Assuming Ranger is being used to authorize access to Knox, we'll need to create an appropriate policy in Ranger for the users and groups which require access. On our local machine, we'll need access to the certificate Knox is using for TLS. In production environments, your Desktop and PKI administrators may already have taken steps to assure that your system has the appropriate certificate installed. By default, Knox uses a self-signed certificate which is not trusted by our system. We can extract this certificate from the Knox server (assuming we have appropriate access) and then specify its location when connecting from Tableau Desktop. Please note this certificate does not contain sensitive data such as private key material. We can execute the following commands to extract the certificate on the Knox server host. knoxserver=$(hostname -f)
openssl s_client -connect ${knoxserver}:8443 <<<'' | openssl x509 -out /tmp/knox.crt
We then need to copy the certificate to our local environment, using scp or some other utility, and take note of its location. We are now ready to connect to Hive from Tableau Desktop. You may want to test connectivity from beeline to ease later troubleshooting, to assure any issues identified are specific to Tableau (although please note beeline will use JDBC). We will use the Hortonworks Hadoop Hive native connector. We will specify our Knox Server hostname in the Server field, the port over which we are connecting to Knox using TLS, an Authentication method of HTTPS, and the username and password of the LDAP user that has access (via the Ranger policies for Knox and for Hive that have been configured). Please see the Advanced users-ldif within Knox configuration for identifying appropriate users in the Demo LDAP (such as guest, sam, and tom). We need to specify the HTTP path of gateway/default/hive, select the Require SSL checkbox, and click the "No custom configuration . . ." orange link to specify the path at which we've saved the Knox certificate (or at which our Desktop Administrator has provided a root certificate that was used to sign Knox's certificate), as seen in the screenshot below.
... View more
Labels:
10-13-2016
04:24 AM
Thanks @Muthukumar S, can you please provide further details? How does role-based authentication work with an on-premise source outside of AWS?
... View more
10-08-2016
08:11 PM
4 Kudos
In HDF 2.0, administrators can secure access to individual NiFi components in order to support multi-tenant authorization. This provides organizations the ability to create least privilege policies for distinct groups of users.
For example, let's imagine we have a NiFi Team and a Hadoop Team at our company and the Hadoop Team can only access dataflows they've created, whereas the NiFi Team can access all dataflows. NiFi 1.0 in HDF 2.0 can use different authorizers, such as file-based policies (managed within NiFi) and Ranger-based policies (managed within Ranger), as well as custom, pluggable authorizers.
In this example, we'll use Ranger. For more detail on configuring Ranger as the authorizer for NiFi, please see
this article. To separate the different teams' dataflows, we'll create separate process groups for each team. In NiFi, access policies are inheritable, supporting simpler policy management with the flexibility of overriding access at the component level. This means that all processors, as well as any nested process groups, within the Hadoop Team's root process group will be accessible by the Hadoop Team automatically.
Let's see an example of the canvas when nifiadmin, a member of the NiFi team, is logged in.
On the other hand, when hadoopadmin, a member of the Hadoop Team is logged in, we'll see a different representation, given the different level of access.
When hadoopadmin drills down into the NiFi Team's process group (notice the title is blank without read access), notice that this user cannot make any changes (the toolbar items are grayed out).
Let's take a look at how this was configured in Ranger. The nifiadmin user has full access to NiFi, so has read and write access to all resources.
Since the hadoopadmin user has more restrictive access, we'll configure separate policies in Ranger for this user. Firstly, hadoopadmin will need read and write access to the /flow resource in order to access the UI and modify any dataflows.
Secondly, this user needs a policy for the root Hadoop Team process group. In order to configure this, we need to capture the globally unique identifier, or GUID, associated with this process group, which is visible and can be copied from the NiFi UI. The Ranger policy will provide read and write access to this process group within the /process-groups resource. Notice that the hadoopadmin can modify the dataflow within the Hadoop Team process group (the toolbar items are not grayed out and new processors can be dragged and dropped onto the canvas).
... View more
Labels:
- « Previous
- Next »