Created on 05-04-2017 08:42 PM
Security is one of fundamental features for enterprise adoption. Specifically, for SQL users, row/column-level access control is important. However, when a cluster is used as a data warehouse accessed by various user groups via different ways like Apache Spark™ 1.6/2.1 and Apache Hive, it is difficult to guarantee access control in a consistent way. In this article, we show the use of Apache Ranger™ to manage access control policy for Spark SQL. This enhancement is done by integrating Apache Ranger and Apache Spark via `spark-llap`. We also show how to provide finer grained access control(row/column level filtering and column masking) to Apache Spark.
For all use cases, make it sure that the permission of Hive warehouse is 700. It means normal users are unable to access the secured tables. In addition, make sure that `hive.warehouse.subdir.inherit.perms=true`. With this, newly created tables will inherit the permission, 700, by default.
$ hadoop fs -ls /apps/hive Found 1 items drwx------ - hive hdfs 0 2017-07-10 17:04 /apps/hive/warehouse
Hive Interactive Query is needed to be enabled.
Hive Ranger Plugin is needed to be enabled.
In this article, we will use two user principals, `billing` and `datascience`. `billing` principal can access all rows and columns while `datascience` principal can access some of filtered and masked data. You can use `hive` and `spark` principal instead.
First of all, find the following five values from Hive configuration via Ambari for your cluster.
Set the following parameter.
*
Set the following two parameters.
hive.llap.task.keytab.file=/etc/security/keytabs/hive.service.keytab hive.llap.task.principal=hive/_HOST@EXAMPLE.COM
For Spark2 shells (spark-shell, pyspark, sparkR, spark-sql) and applications, setup the following configurations via Ambari and restart required services.
spark.hadoop.hive.llap.daemon.service.hosts=the value of hive.llap.daemon.service.hosts spark.hadoop.hive.zookeeper.quorum=The value of hive.zookeeper.quorum spark.sql.hive.hiveserver2.jdbc.url=jdbc:hive2://YourHiveServer2HostName:10500/ spark.sql.hive.hiveserver2.jdbc.url.principal=The value of hive.server2.authentication.kerberos.principal
For Spark2 Thrift Server, setup the following configurations via Ambari and restart required services.
--packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true
Note that this is one line.
spark.hadoop.hive.llap.daemon.service.hosts=the value of hive.llap.daemon.service.hosts spark.hadoop.hive.zookeeper.quorum=The value of hive.zookeeper.quorum spark.sql.hive.hiveserver2.jdbc.url=jdbc:hive2://YourHiveServer2HostName:10500/;hive.server2.proxy.user=${user} spark.sql.hive.hiveserver2.jdbc.url.principal=The value of hive.server2.authentication.kerberos.principal
Since normal users are unable to access Hive databases and tables until now due to HDFS permissions, use Beeline CLI with `hive` principal to prepare a database and a table for users who will use Spark-LLAP. Please note that the following is an example schema setup for demo scenario in this article. In case of audit-only scenarios, you don’t need to make new databases and tables. Instead, you can make all existing databases and tables accessible for all users.
$ kdestroy $ kinit -kt /etc/security/keytabs/hive.service.keytab hive/`hostname -f`@EXAMPLE.COM $ beeline -u "jdbc:hive2://YourHiveServerHost:10500/default;principal=hive/_HOST@EXAMPLE.COM" -e "CREATE DATABASE db_spark; USE db_spark; CREATE TABLE t_spark(name STRING, gender STRING); INSERT INTO t_spark VALUES ('Barack Obama', 'M'), ('Michelle Obama', 'F'), ('Hillary Clinton', 'F'), ('Donald Trump', 'M');"
To use a fine-grained access control, you need to setup Apache Ranger™ policies which rule Spark and Hive together seamlessly as a single control center.
Open `Ranger Admin UI`. The default login information is `admin/admin`. After login, `Access Manager` page shows nine service managers. The following screenshot means there exists HDFS / Hive / YARN policies. Since Spark shares the same policies with Hive, visit `Hive` among service managers.
In Hive Service Manager page, there are three tabs corresponding three types of policies: `Access`, `Masking`, and `Row Level Filter`.
Let’s make some policies for a user to access some rows and columns.
For examples,
Name | Table | Column | Select User | Permissions |
spark_access | t_spark | * | billing | Select |
spark_access | t_spark | * | datascience | Select |
Name | Table | Column | Select User | Access Types | Select Masking Option |
spark_mask | t_spark | name | datascience | Select | partial mask:'show first 4' |
Name | Table | Access Types | Row Level Filter |
spark_filter | t_spark | Select | gender='M' |
In HDFS Ranger plugin, add a rule `spark_tmp` to allow all accesses on `/tmp`.
A user can access the Spark Thrift Server via beeline or Apache Zeppelin™. First, based on the kerberos principal, the user can see only the accessible data.
$ kdestroy $ kinit billing/billing@EXAMPLE.COM $ beeline -u "jdbc:hive2://YourSparkThriftServer:10016/db_spark;principal=hive/_HOST@EXAMPLE.COM" -e 'select * from db_spark.t_spark' +------------------+---------+--+ | name | gender | +------------------+---------+--+ | Barack Obama | M | | Michelle Obama | F | | Hillary Clinton | F | | Donald Trump | M | +------------------+---------+--+ $ kdestroy $ kinit datascience/datascience@EXAMPLE.COM $ beeline -u "jdbc:hive2://YourSparkThriftServer:10016/db_spark;principal=hive/_HOST@EXAMPLE.COM" -e 'select * from db_spark.t_spark' +---------------+---------+--+ | name | gender | +---------------+---------+--+ | Baraxx xxxxx | M | | Donaxx xxxxx | M | +---------------+---------+--+
Second, in case of Zeppelin, a proxy user name is used. You can see this youtube demo to see how Zeppelin works. The following example illustrates the usage of proxy user name with beeline. Zeppelin does the same thing via JDBC under the hood.
$ kdestroy $ kinit -kt /etc/security/keytabs/hive.service.keytab hive/`hostname -f`@EXAMPLE.COM $ beeline -u "jdbc:hive2://YourSparkThriftServerHost:10016/db_spark;principal=hive/_HOST@EXAMPLE.COM;hive.server2.proxy.user=billing" -e "select * from db_spark.t_spark" +------------------+---------+--+ | name | gender | +------------------+---------+--+ | Barack Obama | M | | Michelle Obama | F | | Hillary Clinton | F | | Donald Trump | M | +------------------+---------+--+ $ beeline -u "jdbc:hive2://YourSparkThriftServerHost:10016/db_spark;principal=hive/_HOST@EXAMPLE.COM;hive.server2.proxy.user=datascience" -e "select * from db_spark.t_spark" +---------------+---------+--+ | name | gender | +---------------+---------+--+ | Baraxx xxxxx | M | | Donaxx xxxxx | M | +---------------+---------+--+
A user can run `spark-shell` or `pyspark` like the followings. Please note that the user can access own data sources in addition to the secure data source provided by `spark-llap`. For the next example, log in as the user `spark`.
$ kdestroy $ kinit billing/billing@EXAMPLE.COM $ SPARK_MAJOR_VERSION=2 spark-shell --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true scala> sql("select * from db_spark.t_spark").show +---------------+------+ | name|gender| +---------------+------+ | Barack Obama| M| | Michelle Obama| F| |Hillary Clinton| F| | Donald Trump| M| +---------------+------+ $ kdestroy $ kinit datascience/datascience@EXAMPLE.COM $ SPARK_MAJOR_VERSION=2 spark-shell --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true scala> sql("select * from db_spark.t_spark").show +------------+------+ | name|gender| +------------+------+ |Baraxx xxxxx| M| |Donaxx xxxxx| M| +------------+------+
$ kdestroy $ kinit billing/billing@EXAMPLE.COM $ SPARK_MAJOR_VERSION=2 pyspark --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true >>> sql("select * from db_spark.t_spark").show() +---------------+------+ | name|gender| +---------------+------+ | Barack Obama| M| | Michelle Obama| F| |Hillary Clinton| F| | Donald Trump| M| +---------------+------+ $ kdestroy $ kinit datascience/datascience@EXAMPLE.COM $ SPARK_MAJOR_VERSION=2 pyspark --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true >>> sql("select * from db_spark.t_spark").show() +------------+------+ | name|gender| +------------+------+ |Baraxx xxxxx| M| |Donaxx xxxxx| M| +------------+------+
$ kdestroy $ kinit billing/billing@EXAMPLE.COM $ SPARK_MAJOR_VERSION=2 sparkR --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true > head(sql("select * from db_spark.t_spark")) name gender 1 Barack Obama M 2 Michelle Obama F 3 Hillary Clinton F 4 Donald Trump M $ kdestroy $ kinit datascience/datascience@EXAMPLE.COM $ SPARK_MAJOR_VERSION=2 sparkR --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true > head(sql("select * from db_spark.t_spark")) name gender 1 Baraxx xxxxx M 2 Donaxx xxxxx M
$ kdestroy $ kinit billing/billing@EXAMPLE.COM $ SPARK_MAJOR_VERSION=2 spark-sql --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true spark-sql> select * from db_spark.t_spark; Barack ObamaM Michelle ObamaF Hillary ClintonF Donald TrumpM $ kdestroy $ kinit datascience/datascience@EXAMPLE.COM $ SPARK_MAJOR_VERSION=2 spark-sql --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true spark-sql> select * from db_spark.t_spark Baraxx xxxxxM Donaxx xxxxxM
A user can submit his Spark job like the following. Like `spark-shell` scenario, the user can access own data sources in addition to the secure data source provided by `spark-llap`.
from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName("Spark LLAP SQL Python") \ .enableHiveSupport() \ .getOrCreate() spark.sql("show databases").show() spark.sql("select * from db_spark.t_spark").show() spark.stop()
Launch the app with YARN client mode.
SPARK_MAJOR_VERSION=2 spark-submit --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --master yarn --deploy-mode client --conf spark.sql.hive.llap=true spark_llap_sql.py
YARN cluster mode will be supported in next HDP release. You can download the examples at spark_llap_sql.py, too.
Please refer https://github.com/hortonworks-spark/spark-llap/wiki/7.-Support-Matrix
Known Issues
If you see warning messages in Spark shells, you can turn off via `sc.setLogLevel` or `conf/log4j.properties`.
scala> sql("select * from db_spark.t_spark").show ... 17/03/09 22:06:26 WARN TaskSetManager: Stage 5 contains a task of very large size (248 KB). The maximum recommended task size is 100 KB. 17/03/09 22:06:27 WARN LlapProtocolClientProxy: RequestManager shutdown with error java.util.concurrent.CancellationException ... scala> sc.setLogLevel("ERROR") scala> sql("select * from db_spark.t_spark").show ...
Created on 06-10-2017 08:10 AM
Created on 06-23-2017 07:17 PM
Sorry for late response. Is your cluster connected to the internet to download the jar file? What error did you see?
Created on 08-23-2017 12:48 AM
The same problem here @Dongjoon Hyun, the cluster is not connected to the Internet, and browsing http://repo.hortonworks.com/content/groups/public/com/hortonworks/spark/spark-llap-assembly_2.11/1.1... returns no jars, only pom for Maven.
Created on 08-23-2017 01:09 AM
That's an assembly jar location for Spark `--packages`. If you want to a standalone jar for no-internet connection, please look at this.
http://repo.hortonworks.com/content/groups/public/com/hortonworks/spark/spark-llap_2.11/1.1.3-2.1/
Created on 08-24-2017 02:11 AM
Thank you for your reply,The problem is resolved,The problem's reason is download the jar file error.
In addition,I have a problem is Turn on Hive LLAP,Spark ThriftServer access Hive speed slow down。I try to modify some of the Hive llap related parameters.but no significant effect。I need your Professional advice and help。Can I get your email and get more help.Hope your quick answer,thanks!
Created on 08-24-2017 06:45 PM
For your emails, I already answered your email questions personally, and Hortonworks support team can help you for the further professional advice and help.
Created on 08-26-2017 01:31 AM
Thanks,I receive your reply.I will focus your blog article and github code about llap.
Created on 09-19-2017 07:26 AM - edited 08-17-2019 12:59 PM
spark thrift server jdbc error:
Created on 09-19-2017 06:29 PM
Hi, @chenhao chenhao
As you can see in the image, the error comes from Hive side, "Invalid function `get_splits`". You seems to use some old Hive. Please use the correct version of Hive in HDP.
Created on 01-16-2018 04:44 AM
Hello,
I followed the instruction with HDP 2.6.3.0, however, Spark2 Thrift Server stops right after starting it with the following error in /var/log/spark2/spark-hive-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-<HOSTNAME>.out:
Exception in thread "main" java.lang.IllegalArgumentException: Unable to instantiate SparkSession with LLAP support because LLAP or Hive classes are not found. at org.apache.spark.sql.SparkSession$.isLLAPEnabled(SparkSession.scala:1104) at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$externalCatalogClassName(SharedState.scala:174) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:95) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:93) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:53) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:81) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Looks like this error indicates Spark2 Thrift Server fails to load org.apache.hadoop.hive.conf.HiveConf or org.apache.spark.sql.hive.llap.LlapSessionStateBuilder classes. I found com.hortonworks.spark_spark-llap_2.11-1.1.3-2.1.jar, which Spark2 Thrift Server is using, does not have org.apache.hadoop.hive.conf.HiveConf but shadehive.org.apache.hadoop.hive.conf.HiveConf
Can I ask if it's a bug? Can I also ask if there is a workaround?
Thank you in advance,
Mai Nakagawa