Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark SQL Drop vs Select

Highlighted

Spark SQL Drop vs Select

Explorer

I want to pick and choose only a subset of the columns of a dataframe / table given some Array of Columns

E.g., given a table with columns [a,b,c,d,e] and I want to keep [a,c,e].

I see that I could either drop all columns that are not in my Array, or select the columns in my Array.

My question is which one is preferred efficiency-wise? Or does it even matter?

Thank you.

4 REPLIES 4
Highlighted

Re: Spark SQL Drop vs Select

Expert Contributor

Spark uses and optimizer under the hood, so there will be no performance difference.

Highlighted

Re: Spark SQL Drop vs Select

Super Guru

you can try both and look at the DAGs in the Spark History UI.

Most likely performance will be the same.

what's your SQL, there are some optimizations

Highlighted

Re: Spark SQL Drop vs Select

I think it won't same performance .

Consider below two statements, Let say I have a DataFrame with columns a,b,c,d,e. I want to select only a,b,c

1) df.select("a","b","c") In this , it will only select the required columns and never bothered about other columns

2) df.select.("a","b","c","d","e").drop("d","e") In this First it select all the columns from DataFrame and then drops the unwanted columns.

I think in second statement there will be some performance down..

If Iam wrong..please clarify it


Re: Spark SQL Drop vs Select

New Contributor

Spark Catalyst Optimiser is smart.If it not optimising well then you have to think about it else it is able to optimise.

Below is one example:

fr = spark.createDataframe([{'a':1},{'b':2}])

fr.select('a','b').drop('a')

parsed logical plan for above query is below

Parsed Logical Plan ==
Project [b#69L]
+- Project [a#68L, b#69L]
+- LogicalRDD [a#68L, b#69L], false

 

And Physical plan is below

Physical Plan ==
*(1) Project [b#69L]
+- *(1) Scan ExistingRDD[a#68L,b#69L]

Spark is optimising the query from two projection to single projection Which is same as Physical plan of fr.select('a').

Don't have an account?
Coming from Hortonworks? Activate your account here