Support Questions
Find answers, ask questions, and share your expertise

How to replace blank rows in pyspark Dataframe?

Super Collaborator

I am using Spark 1.6.2 and I have a data frame like this:

|Extension|

|gif |

|gif |

|gif |

|gif |

|html |

|gif |

|html |

|html |

| |

| |

|gif |

As you can see, there are some blank rows. They are not null because when I ran isNull() on the data frame, it showed false for all records. Then I thought of replacing those blank values to something like 'None' using regexp_replace. It does not affect the data frame column values. Running the following command right now:

%pyspark

from pyspark.sql.functions import *

extension_df3 = extension_df1.select(regexp_replace('Extension','\\s','None').alias('Extension'))

extension_df3.show(100,truncate=False)

I am replacing on the basis of white space which I guess is wrong. Can somebody please guide me how to do it?

1 ACCEPTED SOLUTION

Accepted Solutions

Super Collaborator

It worked, I changed regexp_replace to replace function. Used the following command:

%pyspark

from pyspark.sql.functions import *

extension_df4 = extension_df1.replace('','None','Extension').alias('Extension')

extension_df4.show(100,truncate=False)

It gives me the following output:

|Extension|

|gif |

|gif |

|gif |

|gif |

|html |

|gif |

|html |

|html |

|None |

|None |

|gif | |

View solution in original post

3 REPLIES 3

Super Collaborator

More information, when I am doing:

%pyspark

from pyspark.sql.functions import *

extension_df3 = extension_df1.select(regexp_replace('Extension','','None').alias('Extension'))

extension_df3.show(100,truncate=False)

It changes the data frame in the state which I do not want:

Extension |

|NonegNoneiNonefNone |

|NonegNoneiNonefNone |

|NonegNoneiNonefNone |

|NonegNoneiNonefNone |

|NonehNonetNonemNonelNone|

|NonegNoneiNonefNone |

|NonehNonetNonemNonelNone| |

NonehNonetNonemNonelNone|

|None |

|None |

|NonegNoneiNonefNone |

Super Collaborator

It worked, I changed regexp_replace to replace function. Used the following command:

%pyspark

from pyspark.sql.functions import *

extension_df4 = extension_df1.replace('','None','Extension').alias('Extension')

extension_df4.show(100,truncate=False)

It gives me the following output:

|Extension|

|gif |

|gif |

|gif |

|gif |

|html |

|gif |

|html |

|html |

|None |

|None |

|gif | |

View solution in original post

@Mushtaq Rizvi I hope what ever you're doing above is just replacing with "None" which is a string which consumes memory.

Let's I've a scenario. I wanted to replace the blank spaces like below with null values. Can you suggest something on how to do this. Because the whitespaces consume memory where as null values doesn't.


|Extension|

|gif |

| |

|gif |

| |

|html |


I wanted it like this.


|Extension|

|gif |

|null|

|gif |

|null|

|html |