Member since
12-27-2016
73
Posts
34
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
17220 | 03-23-2018 09:21 PM | |
1006 | 02-05-2018 07:08 PM | |
4879 | 01-15-2018 07:21 PM | |
849 | 12-01-2017 06:35 PM | |
2696 | 03-09-2017 06:21 PM |
01-16-2018
04:55 PM
As of now, Apache JIRA is `Maintenance in progress`. So, I cannot give you the link. The umbrella ORC JIRA is https://issues.apache.org/jira/browse/SPARK-20901.
... View more
01-16-2018
04:54 PM
If you can wait for it, Apache Spark 2.3 will be released with Apache ORC 1.4.1. There are many ORC patch in Hive. Apache Spark cannot sync it promptly. So, in Apache Spark, we decide to use the latest ORC 1.4.1 library instead of upgrading Hive 1.2.1 library. From Apache Spark 2.3, Hive ORC table is converted into ORC data sources tables by default and uses ORC 1.4.1 library to read it. Not only your issue but also vectorization on ORC are supported. Anyway, again, HDP 2.6.3+ is already shipped with ORC 1.4.1 with vectorization, too.
... View more
01-15-2018
07:21 PM
2 Kudos
Hi, @Rajiv Chodisetti . It's related to HIVE-13232 (fixed in Hive 1.3.0, 2.0.1, 2.1.0), but all Apache Spark still uses Hive 1.2.1 library. Could you try HDP 2.6.3+ (2.6.4 is the latest one). HDP Spark 2.2 has that fixed hive library.
... View more
01-12-2018
06:33 PM
Ur, it's not supported in HDP 2.5.5. BTW, I'm wondering if you need specifically Spark 2.1. If you want to download and install by yourself, the latest one is Apache Spark 2.2.1. In addition, Apache Spark 2.3.0 will be released very soon.
... View more
01-10-2018
06:17 PM
Hi, @Jerrell Schivers . Unfortunately, yes. It's expected due to lack of vectorization support. The upcoming Apache Spark 2.3 supports it (https://issues.apache.org/jira/browse/SPARK-16060). However, you can taste it in HDP 2.6.3 with Spark 2.2. Please refer the following document. https://community.hortonworks.com/articles/148917/orc-improvements-for-apache-spark-22.html
... View more
01-02-2018
08:56 PM
Let's ping the maintainer of SHC. Ping, @wyang Could you help @Eric Hanson?
... View more
01-02-2018
08:37 PM
Hi, @Eric Hanson . SHC seems to work for both Spark 1.6.3 and Spark 2.2. Could you share your specific problem with SHC here?
... View more
12-05-2017
12:05 AM
I see. Yes, Ranger and Parquet does. I believe you can find a way for your requirements!
... View more
12-04-2017
06:39 PM
I'm wondering the use case because both `spark.sql.groupByOrdinal` and `spark.sql.orderByOrdinal` are true by default in Spark.
... View more
12-04-2017
06:32 PM
Which Spark version do you use, and could you post a short example of your SQL here?
... View more
12-04-2017
05:12 PM
1 Kudo
In addition to that, STS supports Spark SQL syntax since v2.0.0. If you want to use Spark SQL Syntax with SQL 2003 support, it's a good choice. Also, you can use Spark-specific syntax like `CACHE TABLE`, too.
... View more
12-01-2017
06:35 PM
Sorry, @Subramaniam Ramasubramanian. You cannot connect to your Spark-shell via JDBC.
... View more
11-20-2017
10:11 AM
5 Kudos
1. Introduction 1.1 HDP 2.6.3 provides Apache Spark 2.2 with Apache ORC 1.4 Since Apache Spark 1.4.1, Spark supports ORC as one of its FileFormat . This article introduces how to use another faster ORC file format with Apache Spark 2.2 in HDP 2.6.3. First, in order to show how to choose a FileFormat , Section 1.2 will show an example which writes and reads with ORCFileFormat . Section 2 shows a brief performance comparison, and Section 3 explains more use cases and ORC configurations. Section 4 summarizes ORC-related Apache Spark fixes included in HDP 2.6.3. 1.2 Usage Example: Write and Read with ORCFileFormat
%spark2.spark
// Save 5 rows into an ORC file.
spark.range(5).write.format("orc").mode("overwrite").save("/tmp/orc")
// Read a DataFrame from the ORC file with the existing ORCFileFormat.
spark.read.format("orc").load("/tmp/orc").count
// Read a DataFrame from the ORC file with a new ORCFileFormat.
spark.read.format("org.apache.spark.sql.execution.datasources.orc").load("/tmp/orc").count
res4: Long = 5
res7: Long = 5
2. Performance Comparison Here, I’ll show you a small and quick performance comparison to show difference. For TPCDS 10TB performance comparison, please refer the presentation at DataWorks Summit in Reference section. 2.1 Prepare a test data (about 100m rows)
%spark2.spark
val df = spark.range(200000000).sample(true, 0.5)
df.write.format("orc").mode("overwrite").save("/tmp/orc_100m")
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
2.2 See the difference in 10 seconds
%spark2.spark
// New ORC file format
spark.time(spark.read.format("org.apache.spark.sql.execution.datasources.orc").load("/tmp/orc_100m").count)
// Old ORC file format
spark.time(spark.read.format("orc").load("/tmp/orc_100m").count)
Time taken: 345 ms
res10: Long = 100000182
Time taken: 3518 ms
res12: Long = 100000182
3. How does it works? 3.1 Vectorization New ORC file format in HDP 2.6.3, org.apache.spark.sql.execution.datasources.orc , is faster than old ORC file format. The performance difference comes from vectorization. Apache Spark has ColumnarBatch and Apache ORC has RowBatch separately. By combining these two vectorization techniques, we achieved the performance gain like the above. Previously, Apache Spark took advantages of its ColumnarBatch format with Apache Parquet only. In addition, Apache Spark community has been putting efforts on SPARK-20901 Feature parity for ORC with Parquet. Recently, with new Apache ORC 1.4.1 (released 16th Oct), Spark becomes more stable and faster. 3.2 Do you want to use new ORC file format by default? Here is `spark.sql.orc.enabled` for that.
%spark2.spark
sql("SET spark.sql.orc.enabled=true")
spark.time(spark.read.format("orc").load("/tmp/orc_100m").count)
sql("SET spark.sql.orc.enabled=false")
spark.time(spark.read.format("orc").load("/tmp/orc_100m").count)
res13: org.apache.spark.sql.DataFrame = [key: string, value: string]
Time taken: 273 ms
res14: Long = 100000182
res16: org.apache.spark.sql.DataFrame = [key: string, value: string]
Time taken: 4083 ms
res17: Long = 100000182
3.3 Does it work with SQL, too? Yes, it does!
%spark2.spark
df.write.format("orc").mode("overwrite").saveAsTable("t1")
df.write.format("orc").mode("overwrite").saveAsTable("t2")
sql("SET spark.sql.orc.enabled=true")
spark.time(sql("SELECT COUNT(*) FROM t1").collect)
sql("SET spark.sql.orc.enabled=false")
spark.time(sql("SELECT COUNT(*) FROM t2").collect)
res21: org.apache.spark.sql.DataFrame = [key: string, value: string]
Time taken: 404 ms
res22: Array[org.apache.spark.sql.Row] = Array([100000182])
res24: org.apache.spark.sql.DataFrame = [key: string, value: string]
Time taken: 4333 ms
res25: Array[org.apache.spark.sql.Row] = Array([100000182])
3.4 How can I create a table using new ORC file format only?
%spark2.spark
sql("DROP TABLE IF EXISTS o1")
sql("CREATE TABLE o1 USING `org.apache.spark.sql.execution.datasources.orc` AS SELECT * FROM t1")
sql("SET spark.sql.orc.enabled=false")
spark.time(sql("SELECT COUNT(*) FROM o1").collect)
res26: org.apache.spark.sql.DataFrame = []
res27: org.apache.spark.sql.DataFrame = []
res28: org.apache.spark.sql.DataFrame = [key: string, value: string]
Time taken: 213 ms
res29: Array[org.apache.spark.sql.Row] = Array([100000182])
3.5 Do you want to read existing Hive tables created by `STORED AS`? Here is `spark.sql.hive.convertMetastoreOrc`
%spark2.spark
sql("DROP TABLE IF EXISTS h1")
sql("CREATE TABLE h1 STORED AS ORC AS SELECT * FROM t1")
sql("SET spark.sql.hive.convertMetastoreOrc=true")
sql("SET spark.sql.orc.enabled=true")
spark.time(sql("SELECT COUNT(*) FROM h1").collect)
res30: org.apache.spark.sql.DataFrame = []
res31: org.apache.spark.sql.DataFrame = []
res33: org.apache.spark.sql.DataFrame = [key: string, value: string]
res34: org.apache.spark.sql.DataFrame = [key: string, value: string]
Time taken: 227 ms
res35: Array[org.apache.spark.sql.Row] = Array([100000182])
3.6 ORC Configuration To utilize new ORC file format, there are more ORC configurations which you should turn on. The followings are the summary of recommended ORC configurations in HDP 2.6.3 and the above. spark.sql.orc.enabled=true enables new ORC format to read/write DataSource Tables and files. spark.sql.hive.convertMetastoreOrc=true enables new ORC format to read/write Hive Tables. spark.sql.orc.filterPushdown=true enables filter pushdown for ORC formats. spark.sql.orc.char.enabled=true enables new ORC format to use CHAR types to read Hive Tables. By default, STRING types are used for performance reason. 4. More features and limitations 4.1 Fixed Apache Spark issues
SPARK-14387 Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc SPARK-16060 Vectorized Orc Reader SPARK-16628 OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation if metastore schema does not match schema stored in ORC files SPARK-18355 Spark SQL fails to read data from a ORC hive table that has a new column added to it SPARK-19809 NullPointerException on empty ORC file SPARK-20682 Support a new faster ORC data source based on Apache ORC SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core SPARK-21422 Depend on Apache ORC 1.4.0 SPARK-21477 Mark LocalTableScanExec’s input data transient SPARK-21791 ORC should support column names with dot SPARK-21787 Support for pushing down filters for date types in ORC SPARK-21831 Remove spark.sql.hive.convertMetastoreOrc config in HiveCompatibilitySuite SPARK-21912 ORC/Parquet table should not create invalid column names SPARK-21929 Support ALTER TABLE table_name ADD COLUMNS(..) for ORC data source SPARK-22146 FileNotFoundException while reading ORC files containing special characters SPARK-22158 convertMetastore should not ignore table property SPARK-22300 Update ORC to 1.4.1 4.2 Limitation Schema evolution and schema merging are not supported officially yet (SPARK-11412). Apache Spark vectorization techniques can be used with a schema with primitive types. For more complex schema, Spark uses non-vectorized reader. Old ORC files may be incorrect information inside TIMESTAMP. Filter Pushdown will be ignored for those old ORC files. 5. Conclusion HDP 2.6.3 provides a powerful combination of Apache Spark 2.2 and Apache ORC 1.4.1 as a technical preview. In Apache Spark community, SPARK-20901 Feature parity for ORC with Parquet is still on-going efforts. We are looking forward to seeing more improvements in Apache Spark 2.3. Reference ZEPPELIN NOTEBOOK for this article. PERFORMANCE UPDATES: WHEN APACHE ORC MET APACHE SPARK, DataWorks Summit 2017 Sydney, Sep. 20-21 Appendix - How to reset to default options %spark2.spark
sql("SET spark.sql.hive.convertMetastoreOrc=false")
sql("SET spark.sql.orc.enabled=false")
sql("SET spark.sql.orc.filterPushdown=false")
sql("SET spark.sql.orc.char.enabled=false")
res36: org.apache.spark.sql.DataFrame = [key: string, value: string]
res37: org.apache.spark.sql.DataFrame = [key: string, value: string]
res38: org.apache.spark.sql.DataFrame = [key: string, value: string]
res39: org.apache.spark.sql.DataFrame = [key: string, value: string]
... View more
- Find more articles tagged with:
- Data Processing
- How-ToTutorial
- improvement
- orc
- Spark
- spark-sql
Labels:
11-09-2017
04:42 PM
Please create a Hive table on those Parquet files. If Hive can access them securely with Ranger, Spark also can via SPARK-LLAP.
... View more
11-08-2017
06:39 PM
1 Kudo
Could you try the following SPARK-LLAP? It uses Hive LLAP and Ranger inside Spark. Row/Column-level Security in SQL for Apache Spark
... View more
11-01-2017
07:16 PM
Unfortunately, it's Spark 2.1.X behavior. You need to use Hive. BTW, which `ALTER TABLE` do you need? In HDP 2.6.3, Spark 2.2 supports `ALTER TABLE ADD COLUMNS` via the following two issues. - https://issues.apache.org/jira/browse/SPARK-19261 - https://issues.apache.org/jira/browse/SPARK-21929
... View more
10-05-2017
06:29 PM
1 Kudo
That is a valid warning. Old Hive ORC writer doesn't save the correct schema into ORC files. They wrote dummy column names like, `_col1`. You are using that old ORC file. If you generate new ORC file with Hive 2, then you can not see that warning.
... View more
09-19-2017
06:29 PM
Hi, @chenhao chenhao As you can see in the image, the error comes from Hive side, "Invalid function `get_splits`". You seems to use some old Hive. Please use the correct version of Hive in HDP.
... View more
09-15-2017
07:24 PM
If you want to write in a single file, could you try to repartition `new_df` before making temp table? new_df.repartition(1).registerTempTable("new_df")... According to your situation, you may choose a different number of partitions instead of 1.
... View more
09-05-2017
07:04 PM
Hi, @Saurabh Did you do `CREATE FUNCTION`?
... View more
08-24-2017
06:45 PM
For your emails, I already answered your email questions personally, and Hortonworks support team can help you for the further professional advice and help.
... View more
08-23-2017
01:09 AM
That's an assembly jar location for Spark `--packages`. If you want to a standalone jar for no-internet connection, please look at this. http://repo.hortonworks.com/content/groups/public/com/hortonworks/spark/spark-llap_2.11/1.1.3-2.1/
... View more
07-12-2017
07:39 PM
How long does it take? I'm wondering if you could give us the number, for examples, for your query and a simplifed your query like the following. q1 = finalDF.groupBy($"Dseq", $"FmNum", $"yrs",$"mnt",$"FromDnsty").agg(...) // Your query
q2 = finalDF.groupBy($"Dseq", $"FmNum", $"yrs",$"mnt",$"FromDnsty").agg(count($"Dseq"),avg($"Emp"),sum("Ss")) // Simplified your query.
... View more
06-29-2017
06:29 PM
Hi, could you print the partitions like this? Snappy file can be splittable by range. Usually, Spark splits a large parquet with snappy file into multiple partition with the size of spark.sql.files.maxPartitionBytes spark.read.parquet("/output/xxx.snappy.parquet").rdd.partitions.foreach(print)
... View more
06-23-2017
07:17 PM
Sorry for late response. Is your cluster connected to the internet to download the jar file? What error did you see?
... View more
06-19-2017
06:34 PM
In general, `Custom spark2-default`. And `Custom spark2-thrift-sparkconf` if you want to setup for Spark Thrift Server.
... View more
05-04-2017
08:42 PM
12 Kudos
1. Goal
Security is one of fundamental features for enterprise adoption. Specifically, for SQL users, row/column-level access control is important. However, when a cluster is used as a data warehouse accessed by various user groups via different ways like Apache Spark™ 1.6/2.1 and Apache Hive, it is difficult to guarantee access control in a consistent way. In this article, we show the use of Apache Ranger™ to manage access control policy for Spark SQL. This enhancement is done by integrating Apache Ranger and Apache Spark via `spark-llap`. We also show how to provide finer grained access control(row/column level filtering and column masking) to Apache Spark. 2. Key Features
Shared Policies: The data in a cluster can be shared securely and consistently controlled by the shared access rules between Apache Spark and Apache Hive. Audits: All security activities can be monitored and searched in a single place, i.e., Apache Ranger. Resources management: Each user can use different queues while accessing the securely shared Hive data. 3. Environment
HDP 2.6.1.0 with Spark2, Hive and Ranger on a Kerberized cluster SPARK-LLAP: 1.1.3-2.1 4. Assumption 4.1. HDFS Permission
For all use cases, make it sure that the permission of Hive warehouse is 700. It means normal users are unable to access the secured tables. In addition, make sure that `hive.warehouse.subdir.inherit.perms=true`. With this, newly created tables will inherit the permission, 700, by default.
$ hadoop fs -ls /apps/hive
Found 1 items
drwx------ - hive hdfs 0 2017-07-10 17:04 /apps/hive/warehouse
4.2. Interactive Query
Hive Interactive Query is needed to be enabled.
4.3. Ranger Hive Plugin
Hive Ranger Plugin is needed to be enabled.
4.4. User Principals
In this article, we will use two user principals, `billing` and `datascience`. `billing` principal can access all rows and columns while `datascience` principal can access some of filtered and masked data. You can use `hive` and `spark` principal instead. 5. Configurations 5.1. Find the existing configuration of your cluster
First of all, find the following five values from Hive configuration via Ambari for your cluster.
HiveServer2 Interactive Host name
You need a host name instead of JDBC URL using Zookeeper.
Spark Thrift Server Host
This value will be used during running example code.
The value for hive.llap.daemon.service.hosts
The value for hive.zookeeper.quorum
The value for hive.server2.authentication.kerberos.principal
5.2. Setup HDFS
Set the following parameter.
`Custom core-site` → `hadoop.proxyuser.hive.hosts` `*` will allow all hosts to submit Spark jobs with SPARK-LLAP.
* 5.3. Setup Hive
Set the following two parameters.
`Custom hive-interactive-site` These are the same values with `hive.llap.daemon.keytab.file` and `hive.llap.daemon.service.principal` in the same `Custom hive-interactive-site`.
hive.llap.task.keytab.file=/etc/security/keytabs/hive.service.keytab
hive.llap.task.principal=hive/_HOST@EXAMPLE.COM
5.4. Setup Spark2
For Spark2 shells (spark-shell, pyspark, sparkR, spark-sql) and applications, setup the following configurations via Ambari and restart required services.
`Custom spark2-default`
spark.hadoop.hive.llap.daemon.service.hosts=the value of hive.llap.daemon.service.hosts
spark.hadoop.hive.zookeeper.quorum=The value of hive.zookeeper.quorum
spark.sql.hive.hiveserver2.jdbc.url=jdbc:hive2://YourHiveServer2HostName:10500/
spark.sql.hive.hiveserver2.jdbc.url.principal=The value of hive.server2.authentication.kerberos.principal
5.5. Setup Spark2 Thrift Server
For Spark2 Thrift Server, setup the following configurations via Ambari and restart required services.
`Advanced spark2-env` → `spark_thrift_cmd_opts`
--packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true
Note that this is one line.
`Custom spark2-thrift-sparkconf` Note that `spark.sql.hive.hiveserver2.jdbc.url` additionally has `;hive.server2.proxy.user=${user}` for impersonation.
spark.hadoop.hive.llap.daemon.service.hosts=the value of hive.llap.daemon.service.hosts
spark.hadoop.hive.zookeeper.quorum=The value of hive.zookeeper.quorum
spark.sql.hive.hiveserver2.jdbc.url=jdbc:hive2://YourHiveServer2HostName:10500/;hive.server2.proxy.user=${user}
spark.sql.hive.hiveserver2.jdbc.url.principal=The value of hive.server2.authentication.kerberos.principal
6. Prepare database
Since normal users are unable to access Hive databases and tables until now due to HDFS permissions, use Beeline CLI with `hive` principal to prepare a database and a table for users who will use Spark-LLAP. Please note that the following is an example schema setup for demo scenario in this article. In case of audit-only scenarios, you don’t need to make new databases and tables. Instead, you can make all existing databases and tables accessible for all users. $ kdestroy
$ kinit -kt /etc/security/keytabs/hive.service.keytab hive/`hostname -f`@EXAMPLE.COM
$ beeline -u "jdbc:hive2://YourHiveServerHost:10500/default;principal=hive/_HOST@EXAMPLE.COM" -e "CREATE DATABASE db_spark; USE db_spark; CREATE TABLE t_spark(name STRING, gender STRING); INSERT INTO t_spark VALUES ('Barack Obama', 'M'), ('Michelle Obama', 'F'), ('Hillary Clinton', 'F'), ('Donald Trump', 'M');"
7. Security Policies
To use a fine-grained access control, you need to setup Apache Ranger™ policies which rule Spark and Hive together seamlessly as a single control center. 7.1. Ranger Admin UI
Open `Ranger Admin UI`. The default login information is `admin/admin`. After login, `Access Manager` page shows nine service managers. The following screenshot means there exists HDFS / Hive / YARN policies. Since Spark shares the same policies with Hive, visit `Hive` among service managers.
In Hive Service Manager page, there are three tabs corresponding three types of policies: `Access`, `Masking`, and `Row Level Filter`.
7.2. Example Policies
Let’s make some policies for a user to access some rows and columns.
For examples,
Both `billing` and `datascience` principal can access `db_spark` database. The other databases are not allowed by default due to the HDFS permission. `billing` principal can see all rows and columns of `t_spark` table in `db_spark` database. `datascience` principal can see the first 4 characters of `name` field of `t_spark` table in `db_spark` database. The remaining part of `name` field will be masked. `datascience` principal can see only the rows filtered by `gender=M` predicate. 7.2.1. Access policy in db_spark database
Name
Table
Column
Select User
Permissions
spark_access
t_spark
*
billing
Select
spark_access
t_spark
*
datascience
Select
7.2.2. Masking policy in db_spark database
Name
Table
Column
Select User
Access Types
Select Masking Option
spark_mask
t_spark
name
datascience
Select
partial mask:'show first 4'
7.2.3. Row Level Filter policy in db_spark database
Name
Table
Access Types
Row Level Filter
spark_filter
t_spark
Select
gender='M'
7.2.4. HDFS policy for spark In HDFS Ranger plugin, add a rule `spark_tmp` to allow all accesses on `/tmp`. 8. Target Scenarios Case 1: Secure Spark Thrift Server
A user can access the Spark Thrift Server via beeline or Apache Zeppelin™. First, based on the kerberos principal, the user can see only the accessible data.
$ kdestroy
$ kinit billing/billing@EXAMPLE.COM
$ beeline -u "jdbc:hive2://YourSparkThriftServer:10016/db_spark;principal=hive/_HOST@EXAMPLE.COM" -e 'select * from db_spark.t_spark'
+------------------+---------+--+
| name | gender |
+------------------+---------+--+
| Barack Obama | M |
| Michelle Obama | F |
| Hillary Clinton | F |
| Donald Trump | M |
+------------------+---------+--+
$ kdestroy
$ kinit datascience/datascience@EXAMPLE.COM
$ beeline -u "jdbc:hive2://YourSparkThriftServer:10016/db_spark;principal=hive/_HOST@EXAMPLE.COM" -e 'select * from db_spark.t_spark'
+---------------+---------+--+
| name | gender |
+---------------+---------+--+
| Baraxx xxxxx | M |
| Donaxx xxxxx | M |
+---------------+---------+--+
Second, in case of Zeppelin, a proxy user name is used. You can see this youtube demo to see how Zeppelin works. The following example illustrates the usage of proxy user name with beeline. Zeppelin does the same thing via JDBC under the hood.
$ kdestroy
$ kinit -kt /etc/security/keytabs/hive.service.keytab hive/`hostname -f`@EXAMPLE.COM
$ beeline -u "jdbc:hive2://YourSparkThriftServerHost:10016/db_spark;principal=hive/_HOST@EXAMPLE.COM;hive.server2.proxy.user=billing" -e "select * from db_spark.t_spark"
+------------------+---------+--+
| name | gender |
+------------------+---------+--+
| Barack Obama | M |
| Michelle Obama | F |
| Hillary Clinton | F |
| Donald Trump | M |
+------------------+---------+--+
$ beeline -u "jdbc:hive2://YourSparkThriftServerHost:10016/db_spark;principal=hive/_HOST@EXAMPLE.COM;hive.server2.proxy.user=datascience" -e "select * from db_spark.t_spark"
+---------------+---------+--+
| name | gender |
+---------------+---------+--+
| Baraxx xxxxx | M |
| Donaxx xxxxx | M |
+---------------+---------+--+
Case 2: Shells
A user can run `spark-shell` or `pyspark` like the followings. Please note that the user can access own data sources in addition to the secure data source provided by `spark-llap`. For the next example, log in as the user `spark`.
spark-shell $ kdestroy
$ kinit billing/billing@EXAMPLE.COM
$ SPARK_MAJOR_VERSION=2 spark-shell --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true
scala> sql("select * from db_spark.t_spark").show
+---------------+------+
| name|gender|
+---------------+------+
| Barack Obama| M|
| Michelle Obama| F|
|Hillary Clinton| F|
| Donald Trump| M|
+---------------+------+
$ kdestroy
$ kinit datascience/datascience@EXAMPLE.COM
$ SPARK_MAJOR_VERSION=2 spark-shell --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true
scala> sql("select * from db_spark.t_spark").show
+------------+------+
| name|gender|
+------------+------+
|Baraxx xxxxx| M|
|Donaxx xxxxx| M|
+------------+------+
pyspark $ kdestroy
$ kinit billing/billing@EXAMPLE.COM
$ SPARK_MAJOR_VERSION=2 pyspark --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true
>>> sql("select * from db_spark.t_spark").show()
+---------------+------+
| name|gender|
+---------------+------+
| Barack Obama| M|
| Michelle Obama| F|
|Hillary Clinton| F|
| Donald Trump| M|
+---------------+------+
$ kdestroy
$ kinit datascience/datascience@EXAMPLE.COM
$ SPARK_MAJOR_VERSION=2 pyspark --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true
>>> sql("select * from db_spark.t_spark").show()
+------------+------+
| name|gender|
+------------+------+
|Baraxx xxxxx| M|
|Donaxx xxxxx| M|
+------------+------+
sparkR $ kdestroy
$ kinit billing/billing@EXAMPLE.COM
$ SPARK_MAJOR_VERSION=2 sparkR --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true
> head(sql("select * from db_spark.t_spark"))
name gender
1 Barack Obama M
2 Michelle Obama F
3 Hillary Clinton F
4 Donald Trump M
$ kdestroy
$ kinit datascience/datascience@EXAMPLE.COM
$ SPARK_MAJOR_VERSION=2 sparkR --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true
> head(sql("select * from db_spark.t_spark"))
name gender
1 Baraxx xxxxx M
2 Donaxx xxxxx M
spark-sql $ kdestroy
$ kinit billing/billing@EXAMPLE.COM
$ SPARK_MAJOR_VERSION=2 spark-sql --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true
spark-sql> select * from db_spark.t_spark;
Barack ObamaM
Michelle ObamaF
Hillary ClintonF
Donald TrumpM
$ kdestroy
$ kinit datascience/datascience@EXAMPLE.COM
$ SPARK_MAJOR_VERSION=2 spark-sql --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --conf spark.sql.hive.llap=true
spark-sql> select * from db_spark.t_spark
Baraxx xxxxxM
Donaxx xxxxxM
Case 3: Applications
A user can submit his Spark job like the following. Like `spark-shell` scenario, the user can access own data sources in addition to the secure data source provided by `spark-llap`. from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Spark LLAP SQL Python") \
.enableHiveSupport() \
.getOrCreate()
spark.sql("show databases").show()
spark.sql("select * from db_spark.t_spark").show()
spark.stop()
Launch the app with YARN client mode. SPARK_MAJOR_VERSION=2 spark-submit --packages com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1 --repositories http://repo.hortonworks.com/content/groups/public --master yarn --deploy-mode client --conf spark.sql.hive.llap=true spark_llap_sql.py YARN cluster mode will be supported in next HDP release. You can download the examples at spark_llap_sql.py, too. Appendix Support Matrix Please refer https://github.com/hortonworks-spark/spark-llap/wiki/7.-Support-Matrix
Known Issues Warning logs on CancellationException
If you see warning messages in Spark shells, you can turn off via `sc.setLogLevel` or `conf/log4j.properties`. scala> sql("select * from db_spark.t_spark").show
...
17/03/09 22:06:26 WARN TaskSetManager: Stage 5 contains a task of very large size (248 KB). The maximum recommended task size is 100 KB.
17/03/09 22:06:27 WARN LlapProtocolClientProxy: RequestManager shutdown with error
java.util.concurrent.CancellationException
...
scala> sc.setLogLevel("ERROR")
scala> sql("select * from db_spark.t_spark").show
...
... View more
- Find more articles tagged with:
- Data Science & Advanced Analytics
- How-ToTutorial
- row-level-filtering
- spark-security
- spark-sql
Labels:
03-17-2017
06:41 PM
1 Kudo
In the Ambari, visit `Spark` -> `Configs` -> `Custom spark-hive-site-override` and add there for Spark. hive.mapred.supports.subdirectories=true The spark-shell works like the following. scala> sql("set hive.mapred.supports.subdirectories").show(false)
+-----------------------------------+-----+
|key |value|
+-----------------------------------+-----+
|hive.mapred.supports.subdirectories|true |
+-----------------------------------+-----+
... View more
03-17-2017
06:30 PM
1 Kudo
Could you try the following? spark-sql --hiveconf hive.mapred.supports.subdirectories=true
... View more
- « Previous
-
- 1
- 2
- Next »