Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

From raw Text format to Parquet: no performance boost

From raw Text format to Parquet: no performance boost

Expert Contributor

Hello everyone, the scenario is the following.

 

I have a Spark-SQL program which performs an ETL process on several Hive tables. These tables have been imported from a Teradata database using Sqoop in RAW TEXT with SNAPPY compression (unfortunately the Avro format does not work with the Teradata connector). The time required for the Spark SQL process to complete  is about 1 hour and 15 minutes.

 

 

To improve performance I thought to convert the tables in a more efficient format like Parquet before executing the SparkSQL process. According to the documentation and online discussions this should bring a significant boost respect to using raw text (even compressed with snappy, which is not splitable). Thus I converted all the Hive tables in Parquet format with Snappy compression. I've launched the SparkSQL process on these tables with the same settings (num-executors, driver-memory, executor-memory). The process ended in 1 hour and 20 minutes.

This was very surprising for me. I didn't expect a 30X boost like I read in some discussions but of course I was expecting an improvement.

 

 

 

These are the types of operations performed in the SparkSQL program:

 

 


sqc.sql("SET hive.exec.compress.output=true")
sqc.sql("SET parquet.compression=SNAPPY")


var vcliff = sqc.read.table(s"$swamp_db.DBU_vcliff")
var vtktdoc = sqc.read.table(s"$swamp_db.DBU_vtktdoc")
var vasccrmtkt = sqc.read.table(s"$swamp_db.DBU_vasccrmtkt")

val numPartitions = 7 * 16

// caching
vcliff.registerTempTable("vcliff")
vtktdoc.registerTempTable("vtktdoc")
vasccrmtkt.registerTempTable("vasccrmtkt")


ar ORI_TktVCRAgency = sqc.sql(
s"""
| SELECT tic.CODCLI,
| tic.CODARLPFX,
| tic.CODTKTNUM,
| tic.DATDOCISS,
| vloc.CODTHR,
| vloc.NAMCMPNAMTHR,
| vloc.CODAGNCTY,
| vloc.NAMCIT,
| vloc.NAMCOU,
| vloc.CODCOU,
| vloc.CODTYPTHR,
| vloc.CODZIP,
| vcom.CODCOMORGLEVDPC,
| vcom.DESCOMORGLEVDPC,
| vcom.CODCOMORGLEVRMX,
| vcom.DESCOMORGLEVRMX,
| vcom.CODCOMORGLEVSALUNT,
| vcom.CODPSECOMORGCTYLEVSALUNT,
| vcom.DESCOMORGLEVSALUNT,
| vcom.CODCOMORGLEVRPR,
| vcom.CODPSECOMORGCTYLEVRPR,
| vcom.DESCOMORGLEVRPR,
| vcom.CODCOMORGLEVCTYCNL,
| vcom.CODPSECOMORGCTYLEVCTYCNL,
| vcom.DESCOMORGLEVCTYCNL,
| vcom.CODCOMORGLEVUNT,
| vcom.CODPSECOMORGCTYLEVUNT,
| vcom.DESCOMORGLEVUNT,
| vcli.DESCNL
| FROM $swamp_db.DBU_vlocpos vloc
| LEFT JOIN $swamp_db.DBU_vcomorghiemktgeo vcom ON vloc.codtypthr = vcom.codtypthr
| AND vloc.codthr = vcom.codthr
| LEFT JOIN TicketDocCrm tic ON tic.codvdt7 = vloc.codthr
| LEFT JOIN vcliff vc ON vc.codcli = tic.codcli
| LEFT JOIN $swamp_db.DBU_vclieml vcli ON vc.codcli = vcli.codcli
""".stripMargin)

ORI_TktVCRAgency.registerTempTable("ORI_TktVCRAgency")


[...]

 

 

Can anyone provide some hint on why I am not getting any performance improvement switching from raw text format to Parquet?