Support Questions

Find answers, ask questions, and share your expertise

Spark SQL performance compared to Hive on ORC table

New Contributor

Hello,

I have a simple query of an ORC table which selects a relatively small number of rows from a 10 billion row table. The query is of this form:

select * from <table> where <col>=<value>

On Hive using Tez it runs in a few seconds. However, using Spark SQL it takes about 5 minutes. Based on everything I see it sure seems like Spark is sweeping through the entire table. I've even set spark.sql.orc.filterPushdown=true, but it doesn't help.

Is it reasonable to expect that Spark SQL's performance should be close to that of Hive's?

I'm running HDP 2.6.0.3 using Spark 2.1.0.

Thanks,

Jerrell

2 REPLIES 2

Expert Contributor

Hi, @Jerrell Schivers .

Unfortunately, yes. It's expected due to lack of vectorization support. The upcoming Apache Spark 2.3 supports it (https://issues.apache.org/jira/browse/SPARK-16060).

However, you can taste it in HDP 2.6.3 with Spark 2.2. Please refer the following document.

https://community.hortonworks.com/articles/148917/orc-improvements-for-apache-spark-22.html

New Contributor

Thanks @Dongjoon Hyun for your reply. We'll hopefully be upgrading to HDP 2.6.3 in the near future and will be able to take advantage of the new speed improvements.