Member since
12-20-2016
6
Posts
6
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6382 | 03-16-2017 08:13 PM |
10-27-2017
10:24 PM
Thanks @Kuldeep Kulkarni, that is a good catch ! I have updated the document accordingly.
... View more
09-11-2017
06:36 PM
You dont need to install livy, the remote spark context library required for doAs support comes bundled with Spark thrift server 1.6 in HDP. The default timeout while waiting for the spark session to come up is 90s, but you can increase it by setting 'server.connect.timeout' to a higher value in thrift server config - particularly if cluster is very busy and within 90 seconds a spark application cannot be launched.
... View more
09-07-2017
08:31 PM
Currently this applies only to Apache Spark 1.6 in HDP.
... View more
05-06-2017
10:17 AM
4 Kudos
Spark Thrift Server is a service that allows JDBC and ODBC clients to run Spark SQL queries on Apache Spark.
By default, Spark Thrift Server runs queries under the identity of the operating system account running the Spark Thrift Server. In a multi-user environment, queries often need to run under the identity of the end user who originated the query; this capability is called "user impersonation".
This article describes the work done in HDP 2.6 to support user impersonation for the Spark Thrift Server. The feature is supported in HDP 2.5.x and later versions, for Apache Spark 1 versions 1.6.3 and later.
When user impersonation is enabled, Spark Thrift Server runs queries as the submitting user. By running queries under the user account associated with the submitter, the Thrift Server can enforce user level permissions and access control lists. Associated data cached in Spark is visible only to queries from the submitting user.
User impersonation enables granular access control to Spark SQL at the level of files or tables. For finer-grained access control, such as row- or column-level security, see
this article.
The user impersonation feature is controlled with a property called
doAs. When doAs is set to true, Spark Thrift Server launches an on-demand Spark application to handle user queries. These queries are shared only with connections from the same user. Spark Thrift Server forwards incoming queries to the appropriate Spark application for execution, making the Spark Thrift Server extremely lightweight: it merely acts as a proxy to forward requests and responses. When all user connections for a Spark application are closed at the Spark Thrift Server, the corresponding Spark application also terminates.
Pre-requisites
If storage based authorization is to be enabled, please follow instructions from
Hive documentation.
Configuring and Enabling User Impersonation
To enable user impersonation for the Spark Thrift Server on an Ambari-managed cluster, complete the following steps:
Enable doAs support. Navigate to the “Advanced spark-hive-site-override” section and set
hive.server2.enable.doAs=true
Add DataNucleus jars to the Spark Thrift Server classpath. Navigate to the “Custom spark-thrift-sparkconf” section and add:
spark.jars=/usr/hdp/current/spark-thriftserver/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-rdbms-3.2.9.jar
(Optional) Disable Spark yarn application for Spark Thrift Server master. Navigate to the “Advanced spark-thrift-sparkconf” section and set
spark.master=local
This prevents launching an unused spark-client HiveThriftServer2 application master.
Restart the Spark Thrift Server.
To enable user impersonation for the Spark Thrift Server on a cluster not managed by Ambari, complete the following steps:
Enable doAs support. Add the following setting to the /usr/hdp/current/spark-thriftserver/conf/hive-site.xml file:
<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
</property>
Add DataNucleus jars to the Spark Thrift Server classpath. Add the following setting to the /usr/hdp/current/spark-thriftserver/conf/spark-thrift-sparkconf.conf file:
spark.jars=/usr/hdp/current/spark-thriftserver/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-rdbms-3.2.9.jar
(Optional) Disable Spark yarn application for Spark Thrift Server master. Add the following setting to the /usr/hdp/current/spark-thriftserver/conf/spark-thrift-sparkconf.conf file:
spark.master=local
This prevents launching an unused spark-client HiveThriftServer2 application master.
Restart the Spark Thrift Server.
Impersonation in Action
Permission and ACL Enforcement
When
doAs is enabled, permissions and ACL restrictions are applied on behalf of the submitting user. In the following example, “foo_db” database has a table “drivers”, which only user “foo” can access:
A Beeline session running as user “foo” can access the data, read the drivers table, and create a new table based on the table:
The Spark queries run in a YARN application as user “foo”:
Hence all user permissions and acls are enforced while accessing tables, data or other resources. In addition, all output generated will be for user “foo”.
For the table created in the preceding Beeline session, the owner is user “foo”:
The per-user Spark Application Master ("AM") also gives us the ability to cache data in memory without other users accessing it. The cached data and state are restricted to the Spark AM running the query. The data and state are not in Spark Thrift Server, hence they are not visible to other users.
Spark Thrift Server as Proxy
Spark Thrift Server does not execute the actual user queries, but forwards them to the appropriate user-specific Spark AM.
This improves the scalability and fault tolerance of Spark Thrift Server.
When
doAs is enabled for the Spark Thrift Server, the Thrift Server is responsible for the following features and capabilities:
Authorizing incoming user connections as per SASL rules.
Managing Spark applications launched on behalf of users:
Launch Spark application if no appropriate application exist for the incoming request.
Terminate Spark AM when all relevant user connections are closed at Spark Thrift Server.
Acting as a proxy and forwarding requests/responses to the appropriate user’s Spark AM.
Ensuring that users' long running Spark SQL sessions are supported, by keeping the Kerberos state valid.
Spark Thrift Server and Spark AM launched on behalf of user, can be long running applications in secure kerberized clusters.
We do not require the submitter’s principal/keytab for long running user Spark AM.
Note that Spark Thrift Server continues to require hive principal and keytab.
Enhancements for connection url support.
The connection url format for hive is documented
here for reference. In doAs mode, we have enhanced Spark Thrift Server to support:
default database.
Hive var variables.
Default Database in Connection URL
Specifying the connection URL as “jdbc:hive2://$HOST:$PORT/
my_db” effectively results in an implicit “use my_db” when a user connects.
For an example, see the preceding Beeline session. The
!connect command specified the connection URL for “foo_db”.
hive var variables support.
Hive variables can be used to parameterize queries.
To set a Hive variable, use the set
hivevar command:
set hivevar:key=value
You can also set a Hive variable as part of the connection URL (similar to Hive connection URL format). In the following Beeline example,
plan=miles is appended to the connection URL, and is referenced in the query as ${hivevar:plan}.
Advanced connection management
By default, all connections for a user are forwarded to the same user Spark AM, to execute queries. In some cases, it is necessary to exercise finer-grained control.
Named connections
In doAs mode, we support user-named connections--identified by user-specified connectionId--a Hive conf parameter in the connection URL.
Names connections are useful in scenarios when there is a need to override spark configuration, for example to override YARN queue, or specify a different memory/cores for Spark executors. Named connections are scoped to a user.
For a user, an explicitly specified connectionId can be used to control which Spark AM executes the queries issued. If unspecified, a default implicit connectionId is associated with the Spark AM.
If Spark Thrift Server is unable to find a Spark AM for the given (user, connectionId) combination, it launches a new Spark AM. If already available, the user connection is associated with the existing Spark AM.
For explicitly naming a connection, the Hive conf parameter name to be used is “
spark.sql.thriftServer.connectionId” as detailed in the example session below.
Every Spark AM managed by Spark Thrift Server is associated with a user and a connectionId. Connection Id’s are not globally unique; they are specific to the user.
Named connections allow users to specify their own Spark AM connections. They do not allow a user to access the Spark AM associated with another user.
Data sharing and Named connections
Each connectionId for a user identifies a different Spark AM.
For a user, cached data is shared and available only within a single AM, not across Spark AM’s.
Different user connections on the same Spark AM can leverage previously cached data. Each user connection has its own Hive session (which maintains the current database, Hive variables, and so on), but shares the underlying cached data, executors, and Spark application.
To illustrate, here is a session for the first connection from user “foo” to named connection “conn1”:
As expected, after caching the ‘drivers’ table, the query runs an order of magnitude faster.
A second connection to the same connectionId from user “foo” is able to leverage the cached table from the other active Beeline session and significantly speed up query execution:
Overriding Spark Configuration Settings
If Spark Thrift Server is unable to find an existing Spark AM for a user connection, it will launch a new Spark AM to service user queries. This is applicable to both named connections and for un-named connections.
When a new Spark AM is to be launched, you can override current Spark configuration settings by specifying them in the connection URL. Specify Spark configuration settings as hiveconf variables prepended by the ‘
sparkconf’ prefix:
The following connection URL includes a
spark.executor.memory setting of 4 GB:
jdbc:hive2://sandbox.hortonworks.com:10015/foo_db;principal=hive/_HOST@REALM.COM?spark.sql.thriftServer.connectionId=my_conn;sparkconf.spark.executor.memory=4g
The environment tab of the Spark application shows the appropriate value:
... View more
Labels:
03-16-2017
08:13 PM
2 Kudos
I could run 'Runner' without errors in local mode; so the code itself is probably is not an issue. Can you paste the exception stack (and possibly options) which causes this to surface ? Also, not sure why you are doing the runJob - it will essentially be a noop in this case since data is not cached. Regards, Mridul
... View more
12-20-2016
05:55 PM
org,apache.spark.Logging was a private api in 2.0 It was always marked as a @Private, but was exposed till 1.6; and this was tightened in 2.0 to remove its visibility. Bottomline is, you cant depend on it; and have to modify code to remove dependency on the trait.. Regards, Mridul
... View more