Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

Spark SQL Drop vs Select

Explorer

I want to pick and choose only a subset of the columns of a dataframe / table given some Array of Columns

E.g., given a table with columns [a,b,c,d,e] and I want to keep [a,c,e].

I see that I could either drop all columns that are not in my Array, or select the columns in my Array.

My question is which one is preferred efficiency-wise? Or does it even matter?

Thank you.

4 REPLIES 4

Expert Contributor

Spark uses and optimizer under the hood, so there will be no performance difference.

Super Guru

you can try both and look at the DAGs in the Spark History UI.

Most likely performance will be the same.

what's your SQL, there are some optimizations

I think it won't same performance .

Consider below two statements, Let say I have a DataFrame with columns a,b,c,d,e. I want to select only a,b,c

1) df.select("a","b","c") In this , it will only select the required columns and never bothered about other columns

2) df.select.("a","b","c","d","e").drop("d","e") In this First it select all the columns from DataFrame and then drops the unwanted columns.

I think in second statement there will be some performance down..

If Iam wrong..please clarify it


New Contributor

Spark Catalyst Optimiser is smart.If it not optimising well then you have to think about it else it is able to optimise.

Below is one example:

fr = spark.createDataframe([{'a':1},{'b':2}])

fr.select('a','b').drop('a')

parsed logical plan for above query is below

Parsed Logical Plan ==
Project [b#69L]
+- Project [a#68L, b#69L]
+- LogicalRDD [a#68L, b#69L], false

 

And Physical plan is below

Physical Plan ==
*(1) Project [b#69L]
+- *(1) Scan ExistingRDD[a#68L,b#69L]

Spark is optimising the query from two projection to single projection Which is same as Physical plan of fr.select('a').