Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

How to replace blank rows in pyspark Dataframe?

Super Collaborator

I am using Spark 1.6.2 and I have a data frame like this:

|Extension|

|gif |

|gif |

|gif |

|gif |

|html |

|gif |

|html |

|html |

| |

| |

|gif |

As you can see, there are some blank rows. They are not null because when I ran isNull() on the data frame, it showed false for all records. Then I thought of replacing those blank values to something like 'None' using regexp_replace. It does not affect the data frame column values. Running the following command right now:

%pyspark

from pyspark.sql.functions import *

extension_df3 = extension_df1.select(regexp_replace('Extension','\\s','None').alias('Extension'))

extension_df3.show(100,truncate=False)

I am replacing on the basis of white space which I guess is wrong. Can somebody please guide me how to do it?

1 ACCEPTED SOLUTION

Super Collaborator

It worked, I changed regexp_replace to replace function. Used the following command:

%pyspark

from pyspark.sql.functions import *

extension_df4 = extension_df1.replace('','None','Extension').alias('Extension')

extension_df4.show(100,truncate=False)

It gives me the following output:

|Extension|

|gif |

|gif |

|gif |

|gif |

|html |

|gif |

|html |

|html |

|None |

|None |

|gif | |

View solution in original post

3 REPLIES 3

Super Collaborator

More information, when I am doing:

%pyspark

from pyspark.sql.functions import *

extension_df3 = extension_df1.select(regexp_replace('Extension','','None').alias('Extension'))

extension_df3.show(100,truncate=False)

It changes the data frame in the state which I do not want:

Extension |

|NonegNoneiNonefNone |

|NonegNoneiNonefNone |

|NonegNoneiNonefNone |

|NonegNoneiNonefNone |

|NonehNonetNonemNonelNone|

|NonegNoneiNonefNone |

|NonehNonetNonemNonelNone| |

NonehNonetNonemNonelNone|

|None |

|None |

|NonegNoneiNonefNone |

Super Collaborator

It worked, I changed regexp_replace to replace function. Used the following command:

%pyspark

from pyspark.sql.functions import *

extension_df4 = extension_df1.replace('','None','Extension').alias('Extension')

extension_df4.show(100,truncate=False)

It gives me the following output:

|Extension|

|gif |

|gif |

|gif |

|gif |

|html |

|gif |

|html |

|html |

|None |

|None |

|gif | |

@Mushtaq Rizvi I hope what ever you're doing above is just replacing with "None" which is a string which consumes memory.

Let's I've a scenario. I wanted to replace the blank spaces like below with null values. Can you suggest something on how to do this. Because the whitespaces consume memory where as null values doesn't.


|Extension|

|gif |

| |

|gif |

| |

|html |


I wanted it like this.


|Extension|

|gif |

|null|

|gif |

|null|

|html |