Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to replace blank rows in pyspark Dataframe?

Solved Go to solution
Highlighted

How to replace blank rows in pyspark Dataframe?

Super Collaborator

I am using Spark 1.6.2 and I have a data frame like this:

|Extension|

|gif |

|gif |

|gif |

|gif |

|html |

|gif |

|html |

|html |

| |

| |

|gif |

As you can see, there are some blank rows. They are not null because when I ran isNull() on the data frame, it showed false for all records. Then I thought of replacing those blank values to something like 'None' using regexp_replace. It does not affect the data frame column values. Running the following command right now:

%pyspark

from pyspark.sql.functions import *

extension_df3 = extension_df1.select(regexp_replace('Extension','\\s','None').alias('Extension'))

extension_df3.show(100,truncate=False)

I am replacing on the basis of white space which I guess is wrong. Can somebody please guide me how to do it?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: How to replace blank rows in pyspark Dataframe?

Super Collaborator

It worked, I changed regexp_replace to replace function. Used the following command:

%pyspark

from pyspark.sql.functions import *

extension_df4 = extension_df1.replace('','None','Extension').alias('Extension')

extension_df4.show(100,truncate=False)

It gives me the following output:

|Extension|

|gif |

|gif |

|gif |

|gif |

|html |

|gif |

|html |

|html |

|None |

|None |

|gif | |

3 REPLIES 3

Re: How to replace blank rows in pyspark Dataframe?

Super Collaborator

More information, when I am doing:

%pyspark

from pyspark.sql.functions import *

extension_df3 = extension_df1.select(regexp_replace('Extension','','None').alias('Extension'))

extension_df3.show(100,truncate=False)

It changes the data frame in the state which I do not want:

Extension |

|NonegNoneiNonefNone |

|NonegNoneiNonefNone |

|NonegNoneiNonefNone |

|NonegNoneiNonefNone |

|NonehNonetNonemNonelNone|

|NonegNoneiNonefNone |

|NonehNonetNonemNonelNone| |

NonehNonetNonemNonelNone|

|None |

|None |

|NonegNoneiNonefNone |

Re: How to replace blank rows in pyspark Dataframe?

Super Collaborator

It worked, I changed regexp_replace to replace function. Used the following command:

%pyspark

from pyspark.sql.functions import *

extension_df4 = extension_df1.replace('','None','Extension').alias('Extension')

extension_df4.show(100,truncate=False)

It gives me the following output:

|Extension|

|gif |

|gif |

|gif |

|gif |

|html |

|gif |

|html |

|html |

|None |

|None |

|gif | |

Re: How to replace blank rows in pyspark Dataframe?

New Contributor

@Mushtaq Rizvi I hope what ever you're doing above is just replacing with "None" which is a string which consumes memory.

Let's I've a scenario. I wanted to replace the blank spaces like below with null values. Can you suggest something on how to do this. Because the whitespaces consume memory where as null values doesn't.


|Extension|

|gif |

| |

|gif |

| |

|html |


I wanted it like this.


|Extension|

|gif |

|null|

|gif |

|null|

|html |

Don't have an account?
Coming from Hortonworks? Activate your account here