About mmuralidharan

mmuralidharan · ‎10-27-2017

Thanks @Kuldeep Kulkarni, that is a good catch ! I have updated the document accordingly.

mmuralidharan · ‎09-11-2017

You dont need to install livy, the remote spark context library required for doAs support comes bundled with Spark thrift server 1.6 in HDP. The default timeout while waiting for the spark session to come up is 90s, but you can increase it by setting 'server.connect.timeout' to a higher value in thrift server config - particularly if cluster is very busy and within 90 seconds a spark application cannot be launched.

mmuralidharan · ‎09-07-2017

Currently this applies only to Apache Spark 1.6 in HDP.

mmuralidharan · ‎05-06-2017

Spark Thrift Server is a service that allows JDBC and ODBC clients to run Spark SQL queries on Apache Spark. By default, Spark Thrift Server runs queries under the identity of the operating system account running the Spark Thrift Server. In a multi-user environment, queries often need to run under the identity of the end user who originated the query; this capability is called "user impersonation". This article describes the work done in HDP 2.6 to support user impersonation for the Spark Thrift Server. The feature is supported in HDP 2.5.x and later versions, for Apache Spark 1 versions 1.6.3 and later. When user impersonation is enabled, Spark Thrift Server runs queries as the submitting user. By running queries under the user account associated with the submitter, the Thrift Server can enforce user level permissions and access control lists. Associated data cached in Spark is visible only to queries from the submitting user. User impersonation enables granular access control to Spark SQL at the level of files or tables. For finer-grained access control, such as row- or column-level security, see this article. The user impersonation feature is controlled with a property called doAs. When doAs is set to true, Spark Thrift Server launches an on-demand Spark application to handle user queries. These queries are shared only with connections from the same user. Spark Thrift Server forwards incoming queries to the appropriate Spark application for execution, making the Spark Thrift Server extremely lightweight: it merely acts as a proxy to forward requests and responses. When all user connections for a Spark application are closed at the Spark Thrift Server, the corresponding Spark application also terminates. Pre-requisites If storage based authorization is to be enabled, please follow instructions from Hive documentation. Configuring and Enabling User Impersonation To enable user impersonation for the Spark Thrift Server on an Ambari-managed cluster, complete the following steps: Enable doAs support. Navigate to the “Advanced spark-hive-site-override” section and set hive.server2.enable.doAs=true Add DataNucleus jars to the Spark Thrift Server classpath. Navigate to the “Custom spark-thrift-sparkconf” section and add: spark.jars=/usr/hdp/current/spark-thriftserver/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-rdbms-3.2.9.jar (Optional) Disable Spark yarn application for Spark Thrift Server master. Navigate to the “Advanced spark-thrift-sparkconf” section and set spark.master=local This prevents launching an unused spark-client HiveThriftServer2 application master. Restart the Spark Thrift Server. To enable user impersonation for the Spark Thrift Server on a cluster not managed by Ambari, complete the following steps: Enable doAs support. Add the following setting to the /usr/hdp/current/spark-thriftserver/conf/hive-site.xml file: <property> <name>hive.server2.enable.doAs</name> <value>true</value> </property> Add DataNucleus jars to the Spark Thrift Server classpath. Add the following setting to the /usr/hdp/current/spark-thriftserver/conf/spark-thrift-sparkconf.conf file: spark.jars=/usr/hdp/current/spark-thriftserver/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-rdbms-3.2.9.jar (Optional) Disable Spark yarn application for Spark Thrift Server master. Add the following setting to the /usr/hdp/current/spark-thriftserver/conf/spark-thrift-sparkconf.conf file: spark.master=local This prevents launching an unused spark-client HiveThriftServer2 application master. Restart the Spark Thrift Server. Impersonation in Action Permission and ACL Enforcement When doAs is enabled, permissions and ACL restrictions are applied on behalf of the submitting user. In the following example, “foo_db” database has a table “drivers”, which only user “foo” can access: A Beeline session running as user “foo” can access the data, read the drivers table, and create a new table based on the table: The Spark queries run in a YARN application as user “foo”: Hence all user permissions and acls are enforced while accessing tables, data or other resources. In addition, all output generated will be for user “foo”. For the table created in the preceding Beeline session, the owner is user “foo”: The per-user Spark Application Master ("AM") also gives us the ability to cache data in memory without other users accessing it. The cached data and state are restricted to the Spark AM running the query. The data and state are not in Spark Thrift Server, hence they are not visible to other users. Spark Thrift Server as Proxy Spark Thrift Server does not execute the actual user queries, but forwards them to the appropriate user-specific Spark AM. This improves the scalability and fault tolerance of Spark Thrift Server. When doAs is enabled for the Spark Thrift Server, the Thrift Server is responsible for the following features and capabilities: Authorizing incoming user connections as per SASL rules. Managing Spark applications launched on behalf of users: Launch Spark application if no appropriate application exist for the incoming request. Terminate Spark AM when all relevant user connections are closed at Spark Thrift Server. Acting as a proxy and forwarding requests/responses to the appropriate user’s Spark AM. Ensuring that users' long running Spark SQL sessions are supported, by keeping the Kerberos state valid. Spark Thrift Server and Spark AM launched on behalf of user, can be long running applications in secure kerberized clusters. We do not require the submitter’s principal/keytab for long running user Spark AM. Note that Spark Thrift Server continues to require hive principal and keytab. Enhancements for connection url support. The connection url format for hive is documented here for reference. In doAs mode, we have enhanced Spark Thrift Server to support: default database. Hive var variables. Default Database in Connection URL Specifying the connection URL as “jdbc:hive2://$HOST:$PORT/ my_db” effectively results in an implicit “use my_db” when a user connects. For an example, see the preceding Beeline session. The !connect command specified the connection URL for “foo_db”. hive var variables support. Hive variables can be used to parameterize queries. To set a Hive variable, use the set hivevar command: set hivevar:key=value You can also set a Hive variable as part of the connection URL (similar to Hive connection URL format). In the following Beeline example, plan=miles is appended to the connection URL, and is referenced in the query as ${hivevar:plan}. Advanced connection management By default, all connections for a user are forwarded to the same user Spark AM, to execute queries. In some cases, it is necessary to exercise finer-grained control. Named connections In doAs mode, we support user-named connections--identified by user-specified connectionId--a Hive conf parameter in the connection URL. Names connections are useful in scenarios when there is a need to override spark configuration, for example to override YARN queue, or specify a different memory/cores for Spark executors. Named connections are scoped to a user. For a user, an explicitly specified connectionId can be used to control which Spark AM executes the queries issued. If unspecified, a default implicit connectionId is associated with the Spark AM. If Spark Thrift Server is unable to find a Spark AM for the given (user, connectionId) combination, it launches a new Spark AM. If already available, the user connection is associated with the existing Spark AM. For explicitly naming a connection, the Hive conf parameter name to be used is “ spark.sql.thriftServer.connectionId” as detailed in the example session below. Every Spark AM managed by Spark Thrift Server is associated with a user and a connectionId. Connection Id’s are not globally unique; they are specific to the user. Named connections allow users to specify their own Spark AM connections. They do not allow a user to access the Spark AM associated with another user. Data sharing and Named connections Each connectionId for a user identifies a different Spark AM. For a user, cached data is shared and available only within a single AM, not across Spark AM’s. Different user connections on the same Spark AM can leverage previously cached data. Each user connection has its own Hive session (which maintains the current database, Hive variables, and so on), but shares the underlying cached data, executors, and Spark application. To illustrate, here is a session for the first connection from user “foo” to named connection “conn1”: As expected, after caching the ‘drivers’ table, the query runs an order of magnitude faster. A second connection to the same connectionId from user “foo” is able to leverage the cached table from the other active Beeline session and significantly speed up query execution: Overriding Spark Configuration Settings If Spark Thrift Server is unable to find an existing Spark AM for a user connection, it will launch a new Spark AM to service user queries. This is applicable to both named connections and for un-named connections. When a new Spark AM is to be launched, you can override current Spark configuration settings by specifying them in the connection URL. Specify Spark configuration settings as hiveconf variables prepended by the ‘ sparkconf’ prefix: The following connection URL includes a spark.executor.memory setting of 4 GB: jdbc:hive2://sandbox.hortonworks.com:10015/foo_db;principal=hive/_HOST@REALM.COM?spark.sql.thriftServer.connectionId=my_conn;sparkconf.spark.executor.memory=4g The environment tab of the Spark application shows the appropriate value:

mmuralidharan · ‎03-16-2017

I could run 'Runner' without errors in local mode; so the code itself is probably is not an issue. Can you paste the exception stack (and possibly options) which causes this to surface ? Also, not sure why you are doing the runJob - it will essentially be a noop in this case since data is not cached. Regards, Mridul

mmuralidharan · ‎12-20-2016

org,apache.spark.Logging was a private api in 2.0 It was always marked as a @Private, but was exposed till 1.6; and this was tightened in 2.0 to remove its visibility. Bottomline is, you cant depend on it; and have to modify code to remove dependency on the trait.. Regards, Mridul

Online	Offline
Last Visited	‎10-27-2017 11:29 PM

Member Since	‎12-20-2016 05:32 PM
Last Visited	‎10-27-2017 11:29 PM
Posts	6
Kudos received	6

Cloudera Community

Re: FileAlreadyExistsException when calling saveAs...

Re: User impersonation in Apache Spark 1.6 Thrift ...

Re: User impersonation in Apache Spark 1.6 Thrift ...

Re: User impersonation in Apache Spark 1.6 Thrift ...

User impersonation in Apache Spark 1.6 Thrift Serv...

Re: FileAlreadyExistsException when calling saveAs...

Re: Spark Testing base 1.6.1_0.3.3 for spark versi...