Created 05-16-2016 01:59 PM
Hi, I am using Spark Sql(ver 1.5.2) to read data from hive tables. It took 21 min to get the results where as the same query in hive cli took only 25 sec. Below are my HDP cluster details and spark query execution details.Please help me on this
Cluster Details:
HDP 2.3
30 Nodes(each node has 251 GB RAM)
10TB Memory
2610 Vcores
Spark SQL execution --> spark-sql --master yarn-client --num-executors 40 --executor-cores 5 --executor-memory 4G
Created 05-16-2016 02:01 PM
Also, please share the best practices on how to use spark sql configurations properties
Created 05-16-2016 02:35 PM
One thing is that Hive on Tez is in general significantly faster than Spark SQL. It has had the basic efficiencies that Spark has just added ( Tungsten ... ) for a long time.
However I see some problems in your configuration too:
Hive/Tez can take the whole cluster. Which has 30*250GB RAM = 7500GB. You only give 40*4 = 160GB to Spark. 40 executors on 30 nodes do not make any sense at all in any case. Do 30 or 60 or 90 or 120. The best practice was to make executors large but not too large. 16-32GB is a good size apparently. Less results in more inter process overhead more results in GC problems.
Do you know how many map/reduce tasks you get when the query is run in Tez? You should give Spark at least the same resources to be fair. Also play around with the cores Spark uses them to decide parallellity.
Also Spark may have some problems with Partitioning/Predicate Pushdown features Hive/Tez supports. ( Not sure about the current state of support for these or if you use them )
Finally There is the question of optimization. A wrong join type and your query can be 100x slower or faster. The joy of comparing database performance. Hard to give general tips here.
Created 05-16-2016 02:46 PM
We had a comparative run between hive on tez and spark sql and have run into multiple outliers on sparksql that took a long time. Are you seeing these issues with a single query or have run into this on multiple runs?
If you are using ORC, you can set spark.sql.orc.filterPushdown to true. You can also try increasing executor memory. But you need to look at logs to see where its taking this time and if there are any GC issues.
Created 05-16-2016 02:54 PM
Make sure you have tungsten enabled. Take a look at the history after the run.
sparkConf.set("spark.cores.max", "24") sparkConf.set("spark.serializer", classOf[KryoSerializer].getName) sparkConf.set("spark.sql.tungsten.enabled", "true") sparkConf.set("spark.eventLog.enabled", "true") sparkConf.set("spark.app.id", "YourApp") sparkConf.set("spark.io.compression.codec", "snappy") sparkConf.set("spark.rdd.compress", "true") sparkConf.set("spark.streaming.backpressure.enabled", "true") sparkConf.set("spark.sql.parquet.compression.codec", "snappy") sparkConf.set("spark.sql.parquet.mergeSchema", "true") sparkConf.set("spark.sql.parquet.binaryAsString", "true") val sc = new SparkContext(sparkConf) sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
What does the query look like?
What else is running in YARN?
What kind of tables? Joins?
If you look at the Spark History UI and you can look at what exactly spark was doing. Make sure eventlog is enabled.
Take a look at the Spark logs, might be an error in there.
In the latest version of HDP you get Spark 1.6 which has some performance enhancements.
Created 07-12-2017 07:26 PM
Have you tried Caching the tables (subset) before executing the queries?
Keep in mind, when doing caching on a DataFrame it is Lazy caching which means it will only cache what rows are used in the next processing event. So if you do a query on that DataFrame and only scan 100 rows, those will only be cached, not the entire table. If you do CACHE TABLE MyTableName in SQL though, it is defaulted to be eager caching and will cache the entire table. You can choose LAZY caching in SQL like so:
CACHE LAZY TABLE Sales_Data_1998