Spark SQL performance compared to Hive on ORC table

New Contributor


I have a simple query of an ORC table which selects a relatively small number of rows from a 10 billion row table. The query is of this form:

select * from <table> where <col>=<value>

On Hive using Tez it runs in a few seconds. However, using Spark SQL it takes about 5 minutes. Based on everything I see it sure seems like Spark is sweeping through the entire table. I've even set spark.sql.orc.filterPushdown=true, but it doesn't help.

Is it reasonable to expect that Spark SQL's performance should be close to that of Hive's?

I'm running HDP using Spark 2.1.0.




Expert Contributor

Hi, @Jerrell Schivers .

Unfortunately, yes. It's expected due to lack of vectorization support. The upcoming Apache Spark 2.3 supports it (

However, you can taste it in HDP 2.6.3 with Spark 2.2. Please refer the following document.

New Contributor

Thanks @Dongjoon Hyun for your reply. We'll hopefully be upgrading to HDP 2.6.3 in the near future and will be able to take advantage of the new speed improvements.