Support Questions

Find answers, ask questions, and share your expertise

Spark SQL Drop vs Select

avatar
Contributor

I want to pick and choose only a subset of the columns of a dataframe / table given some Array of Columns

E.g., given a table with columns [a,b,c,d,e] and I want to keep [a,c,e].

I see that I could either drop all columns that are not in my Array, or select the columns in my Array.

My question is which one is preferred efficiency-wise? Or does it even matter?

Thank you.

4 REPLIES 4

avatar
Super Collaborator

Spark uses and optimizer under the hood, so there will be no performance difference.

avatar
Master Guru

you can try both and look at the DAGs in the Spark History UI.

Most likely performance will be the same.

what's your SQL, there are some optimizations

avatar

I think it won't same performance .

Consider below two statements, Let say I have a DataFrame with columns a,b,c,d,e. I want to select only a,b,c

1) df.select("a","b","c") In this , it will only select the required columns and never bothered about other columns

2) df.select.("a","b","c","d","e").drop("d","e") In this First it select all the columns from DataFrame and then drops the unwanted columns.

I think in second statement there will be some performance down..

If Iam wrong..please clarify it


avatar
New Contributor

Spark Catalyst Optimiser is smart.If it not optimising well then you have to think about it else it is able to optimise.

Below is one example:

fr = spark.createDataframe([{'a':1},{'b':2}])

fr.select('a','b').drop('a')

parsed logical plan for above query is below

Parsed Logical Plan ==
Project [b#69L]
+- Project [a#68L, b#69L]
+- LogicalRDD [a#68L, b#69L], false

 

And Physical plan is below

Physical Plan ==
*(1) Project [b#69L]
+- *(1) Scan ExistingRDD[a#68L,b#69L]

Spark is optimising the query from two projection to single projection Which is same as Physical plan of fr.select('a').