Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to replace blank rows in pyspark Dataframe?

avatar
Super Collaborator

I am using Spark 1.6.2 and I have a data frame like this:

|Extension|

|gif |

|gif |

|gif |

|gif |

|html |

|gif |

|html |

|html |

| |

| |

|gif |

As you can see, there are some blank rows. They are not null because when I ran isNull() on the data frame, it showed false for all records. Then I thought of replacing those blank values to something like 'None' using regexp_replace. It does not affect the data frame column values. Running the following command right now:

%pyspark

from pyspark.sql.functions import *

extension_df3 = extension_df1.select(regexp_replace('Extension','\\s','None').alias('Extension'))

extension_df3.show(100,truncate=False)

I am replacing on the basis of white space which I guess is wrong. Can somebody please guide me how to do it?

1 ACCEPTED SOLUTION

avatar
Super Collaborator

It worked, I changed regexp_replace to replace function. Used the following command:

%pyspark

from pyspark.sql.functions import *

extension_df4 = extension_df1.replace('','None','Extension').alias('Extension')

extension_df4.show(100,truncate=False)

It gives me the following output:

|Extension|

|gif |

|gif |

|gif |

|gif |

|html |

|gif |

|html |

|html |

|None |

|None |

|gif | |

View solution in original post

3 REPLIES 3

avatar
Super Collaborator

More information, when I am doing:

%pyspark

from pyspark.sql.functions import *

extension_df3 = extension_df1.select(regexp_replace('Extension','','None').alias('Extension'))

extension_df3.show(100,truncate=False)

It changes the data frame in the state which I do not want:

Extension |

|NonegNoneiNonefNone |

|NonegNoneiNonefNone |

|NonegNoneiNonefNone |

|NonegNoneiNonefNone |

|NonehNonetNonemNonelNone|

|NonegNoneiNonefNone |

|NonehNonetNonemNonelNone| |

NonehNonetNonemNonelNone|

|None |

|None |

|NonegNoneiNonefNone |

avatar
Super Collaborator

It worked, I changed regexp_replace to replace function. Used the following command:

%pyspark

from pyspark.sql.functions import *

extension_df4 = extension_df1.replace('','None','Extension').alias('Extension')

extension_df4.show(100,truncate=False)

It gives me the following output:

|Extension|

|gif |

|gif |

|gif |

|gif |

|html |

|gif |

|html |

|html |

|None |

|None |

|gif | |

avatar
Contributor

@Mushtaq Rizvi I hope what ever you're doing above is just replacing with "None" which is a string which consumes memory.

Let's I've a scenario. I wanted to replace the blank spaces like below with null values. Can you suggest something on how to do this. Because the whitespaces consume memory where as null values doesn't.


|Extension|

|gif |

| |

|gif |

| |

|html |


I wanted it like this.


|Extension|

|gif |

|null|

|gif |

|null|

|html |