Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Spark SQL performance compared to Hive on ORC table

New Contributor

Hello,

I have a simple query of an ORC table which selects a relatively small number of rows from a 10 billion row table. The query is of this form:

select * from <table> where <col>=<value>

On Hive using Tez it runs in a few seconds. However, using Spark SQL it takes about 5 minutes. Based on everything I see it sure seems like Spark is sweeping through the entire table. I've even set spark.sql.orc.filterPushdown=true, but it doesn't help.

Is it reasonable to expect that Spark SQL's performance should be close to that of Hive's?

I'm running HDP 2.6.0.3 using Spark 2.1.0.

Thanks,

Jerrell

2 REPLIES 2

Expert Contributor

Hi, @Jerrell Schivers .

Unfortunately, yes. It's expected due to lack of vectorization support. The upcoming Apache Spark 2.3 supports it (https://issues.apache.org/jira/browse/SPARK-16060).

However, you can taste it in HDP 2.6.3 with Spark 2.2. Please refer the following document.

https://community.hortonworks.com/articles/148917/orc-improvements-for-apache-spark-22.html

New Contributor

Thanks @Dongjoon Hyun for your reply. We'll hopefully be upgrading to HDP 2.6.3 in the near future and will be able to take advantage of the new speed improvements.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.