Hello everyone, the scenario is the following.
I have a Spark-SQL program which performs an ETL process on several Hive tables. These tables have been imported from a Teradata database using Sqoop in RAW TEXT with SNAPPY compression (unfortunately the Avro format does not work with the Teradata connector). The time required for the Spark SQL process to complete is about 1 hour and 15 minutes.
To improve performance I thought to convert the tables in a more efficient format like Parquet before executing the SparkSQL process. According to the documentation and online discussions this should bring a significant boost respect to using raw text (even compressed with snappy, which is not splitable). Thus I converted all the Hive tables in Parquet format with Snappy compression. I've launched the SparkSQL process on these tables with the same settings (num-executors, driver-memory, executor-memory). The process ended in 1 hour and 20 minutes.
This was very surprising for me. I didn't expect a 30X boost like I read in some discussions but of course I was expecting an improvement.
These are the types of operations performed in the SparkSQL program:
var vcliff = sqc.read.table(s"$swamp_db.DBU_vcliff")
var vtktdoc = sqc.read.table(s"$swamp_db.DBU_vtktdoc")
var vasccrmtkt = sqc.read.table(s"$swamp_db.DBU_vasccrmtkt")
val numPartitions = 7 * 16
ar ORI_TktVCRAgency = sqc.sql(
| SELECT tic.CODCLI,
| FROM $swamp_db.DBU_vlocpos vloc
| LEFT JOIN $swamp_db.DBU_vcomorghiemktgeo vcom ON vloc.codtypthr = vcom.codtypthr
| AND vloc.codthr = vcom.codthr
| LEFT JOIN TicketDocCrm tic ON tic.codvdt7 = vloc.codthr
| LEFT JOIN vcliff vc ON vc.codcli = tic.codcli
| LEFT JOIN $swamp_db.DBU_vclieml vcli ON vc.codcli = vcli.codcli
Can anyone provide some hint on why I am not getting any performance improvement switching from raw text format to Parquet?