Community Articles

dhyun · ‎05-04-2017

1. Goal

Security is one of fundamental features for enterprise adoption. Specifically, for SQL users, row/column-level access control is important. However, when a cluster is used as a data warehouse accessed by various user groups via different ways like Apache Spark™ 1.6/2.1 and Apache Hive, it is difficult to guarantee access control in a consistent way. In this article, we show the use of Apache Ranger™ to manage access control policy for Spark SQL. This enhancement is done by integrating Apache Ranger and Apache Spark via `spark-llap`. We also show how to provide finer grained access control(row/column level filtering and column masking) to Apache Spark.

2. Key Features

Shared Policies: The data in a cluster can be shared securely and consistently controlled by the shared access rules between Apache Spark and Apache Hive.
Audits: All security activities can be monitored and searched in a single place, i.e., Apache Ranger.
Resources management: Each user can use different queues while accessing the securely shared Hive data.

3. Environment

HDP 2.6.1.0 with Spark2, Hive and Ranger on a Kerberized cluster
SPARK-LLAP: 1.1.3-2.1

4. Assumption

4.1. HDFS Permission

For all use cases, make it sure that the permission of Hive warehouse is 700. It means normal users are unable to access the secured tables. In addition, make sure that `hive.warehouse.subdir.inherit.perms=true`. With this, newly created tables will inherit the permission, 700, by default.

$ hadoop fs -ls /apps/hive
Found 1 items
drwx------ - hive hdfs 0 2017-07-10 17:04 /apps/hive/warehouse

4.2. Interactive Query

Hive Interactive Query is needed to be enabled.

4.3. Ranger Hive Plugin

Hive Ranger Plugin is needed to be enabled.

4.4. User Principals

In this article, we will use two user principals, `billing` and `datascience`. `billing` principal can access all rows and columns while `datascience` principal can access some of filtered and masked data. You can use `hive` and `spark` principal instead.

5. Configurations

5.1. Find the existing configuration of your cluster

First of all, find the following five values from Hive configuration via Ambari for your cluster.

HiveServer2 Interactive Host name
- You need a host name instead of JDBC URL using Zookeeper.

Spark Thrift Server Host
- This value will be used during running example code.

The value for hive.llap.daemon.service.hosts

The value for hive.zookeeper.quorum

The value for hive.server2.authentication.kerberos.principal

5.2. Setup HDFS

Set the following parameter.

`Custom core-site` → `hadoop.proxyuser.hive.hosts`
- `*` will allow all hosts to submit Spark jobs with SPARK-LLAP.

5.3. Setup Hive

Set the following two parameters.

`Custom hive-interactive-site`
- These are the same values with `hive.llap.daemon.keytab.file` and `hive.llap.daemon.service.principal` in the same `Custom hive-interactive-site`.

hive.llap.task.keytab.file=/etc/security/keytabs/hive.service.keytab
hive.llap.task.principal=hive/[email protected]

5.4. Setup Spark2

For Spark2 shells (spark-shell, pyspark, sparkR, spark-sql) and applications, setup the following configurations via Ambari and restart required services.

`Custom spark2-default`

spark.hadoop.hive.llap.daemon.service.hosts=the value of hive.llap.daemon.service.hosts
spark.hadoop.hive.zookeeper.quorum=The value of hive.zookeeper.quorum
spark.sql.hive.hiveserver2.jdbc.url=jdbc:hive2://YourHiveServer2HostName:10500/
spark.sql.hive.hiveserver2.jdbc.url.principal=The value of hive.server2.authentication.kerberos.principal

5.5. Setup Spark2 Thrift Server

For Spark2 Thrift Server, setup the following configurations via Ambari and restart required services.

`Advanced spark2-env` → `spark_thrift_cmd_opts`

--packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true

Note that this is one line.

`Custom spark2-thrift-sparkconf`
- Note that `spark.sql.hive.hiveserver2.jdbc.url` additionally has `;hive.server2.proxy.user=${user}` for impersonation.

spark.hadoop.hive.llap.daemon.service.hosts=the value of hive.llap.daemon.service.hosts
spark.hadoop.hive.zookeeper.quorum=The value of hive.zookeeper.quorum
spark.sql.hive.hiveserver2.jdbc.url=jdbc:hive2://YourHiveServer2HostName:10500/;hive.server2.proxy.user=${user}
spark.sql.hive.hiveserver2.jdbc.url.principal=The value of hive.server2.authentication.kerberos.principal

6. Prepare database

Since normal users are unable to access Hive databases and tables until now due to HDFS permissions, use Beeline CLI with `hive` principal to prepare a database and a table for users who will use Spark-LLAP. Please note that the following is an example schema setup for demo scenario in this article. In case of audit-only scenarios, you don’t need to make new databases and tables. Instead, you can make all existing databases and tables accessible for all users.

$ kdestroy

$ kinit -kt /etc/security/keytabs/hive.service.keytab hive/`hostname -f`@EXAMPLE.COM

$ beeline -u "jdbc:hive2://YourHiveServerHost:10500/default;principal=hive/[email protected]" -e "CREATE DATABASE db_spark; USE db_spark; CREATE TABLE t_spark(name STRING, gender STRING); INSERT INTO t_spark VALUES ('Barack Obama', 'M'), ('Michelle Obama', 'F'), ('Hillary Clinton', 'F'), ('Donald Trump', 'M');"

7. Security Policies

To use a fine-grained access control, you need to setup Apache Ranger™ policies which rule Spark and Hive together seamlessly as a single control center.

7.1. Ranger Admin UI

Open `Ranger Admin UI`. The default login information is `admin/admin`. After login, `Access Manager` page shows nine service managers. The following screenshot means there exists HDFS / Hive / YARN policies. Since Spark shares the same policies with Hive, visit `Hive` among service managers.

In Hive Service Manager page, there are three tabs corresponding three types of policies: `Access`, `Masking`, and `Row Level Filter`.

7.2. Example Policies

Let’s make some policies for a user to access some rows and columns.

For examples,

Both `billing` and `datascience` principal can access `db_spark` database. The other databases are not allowed by default due to the HDFS permission.
`billing` principal can see all rows and columns of `t_spark` table in `db_spark` database.
`datascience` principal can see the first 4 characters of `name` field of `t_spark` table in `db_spark` database. The remaining part of `name` field will be masked.
`datascience` principal can see only the rows filtered by `gender=M` predicate.

7.2.1. Access policy in db_spark database

Name	Table	Column	Select User	Permissions
spark_access	t_spark	*	billing	Select
spark_access	t_spark	*	datascience	Select

7.2.2. Masking policy in db_spark database

Name	Table	Column	Select User	Access Types	Select Masking Option
spark_mask	t_spark	name	datascience	Select	partial mask:'show first 4'

7.2.3. Row Level Filter policy in db_spark database

Name	Table	Access Types	Row Level Filter
spark_filter	t_spark	Select	gender='M'

7.2.4. HDFS policy for spark

In HDFS Ranger plugin, add a rule `spark_tmp` to allow all accesses on `/tmp`.

8. Target Scenarios

Case 1: Secure Spark Thrift Server

A user can access the Spark Thrift Server via beeline or Apache Zeppelin™. First, based on the kerberos principal, the user can see only the accessible data.

$ kdestroy

$ kinit billing/[email protected]

$ beeline -u "jdbc:hive2://YourSparkThriftServer:10016/db_spark;principal=hive/[email protected]" -e 'select * from db_spark.t_spark'

+------------------+---------+--+
| name | gender |
+------------------+---------+--+
| Barack Obama | M |
| Michelle Obama | F |
| Hillary Clinton | F |
| Donald Trump | M |
+------------------+---------+--+

$ kdestroy

$ kinit datascience/[email protected]

$ beeline -u "jdbc:hive2://YourSparkThriftServer:10016/db_spark;principal=hive/[email protected]" -e 'select * from db_spark.t_spark'
+---------------+---------+--+
| name | gender |
+---------------+---------+--+
| Baraxx xxxxx | M |
| Donaxx xxxxx | M |
+---------------+---------+--+

Second, in case of Zeppelin, a proxy user name is used. You can see this youtube demo to see how Zeppelin works. The following example illustrates the usage of proxy user name with beeline. Zeppelin does the same thing via JDBC under the hood.

$ kdestroy

$ kinit -kt /etc/security/keytabs/hive.service.keytab hive/`hostname -f`@EXAMPLE.COM

$ beeline -u "jdbc:hive2://YourSparkThriftServerHost:10016/db_spark;principal=hive/[email protected];hive.server2.proxy.user=billing" -e "select * from db_spark.t_spark"
+------------------+---------+--+
| name | gender |
+------------------+---------+--+
| Barack Obama | M |
| Michelle Obama | F |
| Hillary Clinton | F |
| Donald Trump | M |
+------------------+---------+--+

$ beeline -u "jdbc:hive2://YourSparkThriftServerHost:10016/db_spark;principal=hive/[email protected];hive.server2.proxy.user=datascience" -e "select * from db_spark.t_spark"
+---------------+---------+--+
| name | gender |
+---------------+---------+--+
| Baraxx xxxxx | M |
| Donaxx xxxxx | M |
+---------------+---------+--+

Case 2: Shells

A user can run `spark-shell` or `pyspark` like the followings. Please note that the user can access own data sources in addition to the secure data source provided by `spark-llap`. For the next example, log in as the user `spark`.

spark-shell

$ kdestroy

$ kinit billing/[email protected]

$ SPARK_MAJOR_VERSION=2 spark-shell --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true
scala> sql("select * from db_spark.t_spark").show
+---------------+------+
| name|gender|
+---------------+------+
| Barack Obama| M|
| Michelle Obama| F|
|Hillary Clinton| F|
| Donald Trump| M|
+---------------+------+

$ kdestroy

$ kinit datascience/[email protected]

$ SPARK_MAJOR_VERSION=2 spark-shell --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true
scala> sql("select * from db_spark.t_spark").show
+------------+------+
| name|gender|
+------------+------+
|Baraxx xxxxx| M|
|Donaxx xxxxx| M|
+------------+------+

pyspark

$ kdestroy

$ kinit billing/[email protected]

$ SPARK_MAJOR_VERSION=2 pyspark --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true

>>> sql("select * from db_spark.t_spark").show()
+---------------+------+
| name|gender|
+---------------+------+
| Barack Obama| M|
| Michelle Obama| F|
|Hillary Clinton| F|
| Donald Trump| M|
+---------------+------+

$ kdestroy

$ kinit datascience/[email protected]

$ SPARK_MAJOR_VERSION=2 pyspark --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true

>>> sql("select * from db_spark.t_spark").show()
+------------+------+
| name|gender|
+------------+------+
|Baraxx xxxxx| M|
|Donaxx xxxxx| M|
+------------+------+

sparkR

$ kdestroy

$ kinit billing/[email protected]

$ SPARK_MAJOR_VERSION=2 sparkR --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true
> head(sql("select * from db_spark.t_spark"))
name gender
1 Barack Obama M
2 Michelle Obama F
3 Hillary Clinton F
4 Donald Trump M

$ kdestroy

$ kinit datascience/[email protected]

$ SPARK_MAJOR_VERSION=2 sparkR --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true
> head(sql("select * from db_spark.t_spark"))
name gender
1 Baraxx xxxxx M
2 Donaxx xxxxx M

spark-sql

$ kdestroy

$ kinit billing/[email protected]

$ SPARK_MAJOR_VERSION=2 spark-sql --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true
spark-sql> select * from db_spark.t_spark;
Barack ObamaM
Michelle ObamaF
Hillary ClintonF
Donald TrumpM

$ kdestroy

$ kinit datascience/[email protected]

$ SPARK_MAJOR_VERSION=2 spark-sql --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true
spark-sql> select * from db_spark.t_spark
Baraxx xxxxxM
Donaxx xxxxxM

Case 3: Applications

A user can submit his Spark job like the following. Like `spark-shell` scenario, the user can access own data sources in addition to the secure data source provided by `spark-llap`.

from pyspark.sql import SparkSession

spark = SparkSession \
 .builder \
 .appName("Spark LLAP SQL Python") \
 .enableHiveSupport() \
 .getOrCreate()

spark.sql("show databases").show()
spark.sql("select * from db_spark.t_spark").show()
spark.stop()

Launch the app with YARN client mode.

SPARK_MAJOR_VERSION=2 spark-submit --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --master yarn --deploy-mode client --conf spark.sql.hive.llap=true spark_llap_sql.py

YARN cluster mode will be supported in next HDP release. You can download the examples at spark_llap_sql.py, too.

Appendix

Support Matrix

Please refer https://github.com/hortonworks-spark/spark-llap/wiki/7.-Support-Matrix

Known Issues

Warning logs on CancellationException

If you see warning messages in Spark shells, you can turn off via `sc.setLogLevel` or `conf/log4j.properties`.

scala> sql("select * from db_spark.t_spark").show
...
17/03/09 22:06:26 WARN TaskSetManager: Stage 5 contains a task of very large size (248 KB). The maximum recommended task size is 100 KB.
17/03/09 22:06:27 WARN LlapProtocolClientProxy: RequestManager shutdown with error
java.util.concurrent.CancellationException
...
scala> sc.setLogLevel("ERROR")
scala> sql("select * from db_spark.t_spark").show
...

dhyun · ‎02-11-2018

Hi, @Mai Nakagawa

You are using a mismatched jar file as you saw in your first exception message.

because LLAP or Hive classes are not found.

This document is about HDP 2.6.1 using Spark 2.1.1.

Since HDP 2.6.3, `spark-llap` for Spark 2.2 is built-in. Please use it.

$ ls -al /usr/hdp/2.6.3.0-235/spark_llap/spark-llap-assembly-1.0.0.2.6.3.0-235.jar
-rw-r--r-- 1 root root 61306448 Oct 30 02:39 /usr/hdp/2.6.3.0-235/spark_llap/spark-llap-assembly-1.0.0.2.6.3.0-235.jar

nakagawa_mai · ‎02-13-2018

Thank you @Dongjoon Hyun! Confirmed it works in HDP 2.6.3 by replacing the jar file with spark-llap-assembly-1.0.0.2.6.3.0-235.jar

dhyun · ‎02-13-2018

Thank you for confirming.

mr_lee · ‎05-31-2019

Hello,

My ENV is HDP3.0, spark-2.11.2.3.1,hive3.0, kerberos enabled, I followed as mentioned above, and connected the sparkthriftserver to execute sql: explain select * from tb1,finally got the results:

Physical plan:HiveTableScan,HiveTableRelation'''',org.apche.hadoop,hive.serde2.lazy.lazySimpleSerDe

instead of llapRealtion. it seems that llap does not work.

ps I use the package spark-llap_2-11-1.0.2.1-assembly.jar.

Cloudera Community

Community Articles

Row/Column-level Security in SQL for Apache Spark 2.1.1

Apache Spark

1. Goal

2. Key Features

3. Environment

4. Assumption

4.1. HDFS Permission

4.2. Interactive Query

4.3. Ranger Hive Plugin

4.4. User Principals

5. Configurations

5.1. Find the existing configuration of your cluster

5.2. Setup HDFS

5.3. Setup Hive

5.4. Setup Spark2

5.5. Setup Spark2 Thrift Server

6. Prepare database

7. Security Policies

7.1. Ranger Admin UI

7.2. Example Policies

7.2.1. Access policy in db_spark database

7.2.2. Masking policy in db_spark database

7.2.3. Row Level Filter policy in db_spark database

7.2.4. HDFS policy for spark

8. Target Scenarios

Case 1: Secure Spark Thrift Server

Case 2: Shells

Case 3: Applications

Appendix

Support Matrix

Warning logs on CancellationException

Re: Row/Column-level Security in SQL for Apache Spark 2.1.1

Re: Row/Column-level Security in SQL for Apache Spark 2.1.1

Re: Row/Column-level Security in SQL for Apache Spark 2.1.1

Re: Row/Column-level Security in SQL for Apache Spark 2.1.1