Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

How to replace blank rows in pyspark Dataframe?

avatar
Super Collaborator

I am using Spark 1.6.2 and I have a data frame like this:

|Extension|

|gif |

|gif |

|gif |

|gif |

|html |

|gif |

|html |

|html |

| |

| |

|gif |

As you can see, there are some blank rows. They are not null because when I ran isNull() on the data frame, it showed false for all records. Then I thought of replacing those blank values to something like 'None' using regexp_replace. It does not affect the data frame column values. Running the following command right now:

%pyspark

from pyspark.sql.functions import *

extension_df3 = extension_df1.select(regexp_replace('Extension','\\s','None').alias('Extension'))

extension_df3.show(100,truncate=False)

I am replacing on the basis of white space which I guess is wrong. Can somebody please guide me how to do it?

1 ACCEPTED SOLUTION

avatar
Super Collaborator

It worked, I changed regexp_replace to replace function. Used the following command:

%pyspark

from pyspark.sql.functions import *

extension_df4 = extension_df1.replace('','None','Extension').alias('Extension')

extension_df4.show(100,truncate=False)

It gives me the following output:

|Extension|

|gif |

|gif |

|gif |

|gif |

|html |

|gif |

|html |

|html |

|None |

|None |

|gif | |

View solution in original post

3 REPLIES 3

avatar
Super Collaborator

More information, when I am doing:

%pyspark

from pyspark.sql.functions import *

extension_df3 = extension_df1.select(regexp_replace('Extension','','None').alias('Extension'))

extension_df3.show(100,truncate=False)

It changes the data frame in the state which I do not want:

Extension |

|NonegNoneiNonefNone |

|NonegNoneiNonefNone |

|NonegNoneiNonefNone |

|NonegNoneiNonefNone |

|NonehNonetNonemNonelNone|

|NonegNoneiNonefNone |

|NonehNonetNonemNonelNone| |

NonehNonetNonemNonelNone|

|None |

|None |

|NonegNoneiNonefNone |

avatar
Super Collaborator

It worked, I changed regexp_replace to replace function. Used the following command:

%pyspark

from pyspark.sql.functions import *

extension_df4 = extension_df1.replace('','None','Extension').alias('Extension')

extension_df4.show(100,truncate=False)

It gives me the following output:

|Extension|

|gif |

|gif |

|gif |

|gif |

|html |

|gif |

|html |

|html |

|None |

|None |

|gif | |

avatar

@Mushtaq Rizvi I hope what ever you're doing above is just replacing with "None" which is a string which consumes memory.

Let's I've a scenario. I wanted to replace the blank spaces like below with null values. Can you suggest something on how to do this. Because the whitespaces consume memory where as null values doesn't.


|Extension|

|gif |

| |

|gif |

| |

|html |


I wanted it like this.


|Extension|

|gif |

|null|

|gif |

|null|

|html |