Support Questions

jestinm · ‎06-30-2016

I want to pick and choose only a subset of the columns of a dataframe / table given some Array of Columns

E.g., given a table with columns [a,b,c,d,e] and I want to keep [a,c,e].

I see that I could either drop all columns that are not in my Array, or select the columns in my Array.

My question is which one is preferred efficiency-wise? Or does it even matter?

Thank you.

jwiden · ‎06-30-2016

Spark uses and optimizer under the hood, so there will be no performance difference.

TimothySpann · ‎06-30-2016

you can try both and look at the DAGs in the Spark History UI.

Most likely performance will be the same.

what's your SQL, there are some optimizations

pranay_bomminen · ‎11-09-2017

I think it won't same performance .

Consider below two statements, Let say I have a DataFrame with columns a,b,c,d,e. I want to select only a,b,c

1) df.select("a","b","c") In this , it will only select the required columns and never bothered about other columns

2) df.select.("a","b","c","d","e").drop("d","e") In this First it select all the columns from DataFrame and then drops the unwanted columns.

I think in second statement there will be some performance down..

If Iam wrong..please clarify it

RahulGoyal · ‎01-18-2020

Spark Catalyst Optimiser is smart.If it not optimising well then you have to think about it else it is able to optimise.

Below is one example:

fr = spark.createDataframe([{'a':1},{'b':2}])

fr.select('a','b').drop('a')

parsed logical plan for above query is below

Parsed Logical Plan ==
Project [b#69L]
+- Project [a#68L, b#69L]
+- LogicalRDD [a#68L, b#69L], false

And Physical plan is below

Physical Plan ==
*(1) Project [b#69L]
+- *(1) Scan ExistingRDD[a#68L,b#69L]

Spark is optimising the query from two projection to single projection Which is same as Physical plan of fr.select('a').

Spark SQL Drop vs Select