Created 06-30-2016 06:46 PM
I want to pick and choose only a subset of the columns of a dataframe / table given some Array of Columns
E.g., given a table with columns [a,b,c,d,e] and I want to keep [a,c,e].
I see that I could either drop all columns that are not in my Array, or select the columns in my Array.
My question is which one is preferred efficiency-wise? Or does it even matter?
Thank you.
Created 06-30-2016 09:36 PM
Spark uses and optimizer under the hood, so there will be no performance difference.
Created 06-30-2016 09:50 PM
you can try both and look at the DAGs in the Spark History UI.
Most likely performance will be the same.
what's your SQL, there are some optimizations
Created 11-09-2017 08:49 AM
I think it won't same performance .
Consider below two statements, Let say I have a DataFrame with columns a,b,c,d,e. I want to select only a,b,c
1) df.select("a","b","c") In this , it will only select the required columns and never bothered about other columns
2) df.select.("a","b","c","d","e").drop("d","e") In this First it select all the columns from DataFrame and then drops the unwanted columns.
I think in second statement there will be some performance down..
If Iam wrong..please clarify it
Created 01-18-2020 02:50 AM
Spark Catalyst Optimiser is smart.If it not optimising well then you have to think about it else it is able to optimise.
Below is one example:
fr = spark.createDataframe([{'a':1},{'b':2}])
fr.select('a','b').drop('a')
parsed logical plan for above query is below
Parsed Logical Plan ==
Project [b#69L]
+- Project [a#68L, b#69L]
+- LogicalRDD [a#68L, b#69L], false
And Physical plan is below
Physical Plan ==
*(1) Project [b#69L]
+- *(1) Scan ExistingRDD[a#68L,b#69L]
Spark is optimising the query from two projection to single projection Which is same as Physical plan of fr.select('a').