Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

pyspark, how to perform code never ending

pyspark, how to perform code never ending

New Contributor

I have a huge dataset (1 TB) with 9 billions of different ID and the script I developped is never ending.

A sample of my dataset for one ID :

 

    ID  day   location1 location2 
    a   05/01    Rome     Paris    a   08/01    Zurich   Amsterdam    a   09/01    None     Rome

What I whant:

    a   05/01    Rome       Paris    a   06/01    Paris      Paris    a   07/01    Paris      Paris    a   08/01    Zurich     Amsterdam    a   09/01    Amsterdam  Rome

As it is show in the exemple I need to add all the missing days for each user and to consider that the user is not moved during that days when I don't have any records.

Anyone has any idea to approch this problem in a efficient way?

Thanks