Reply
Highlighted
New Contributor
Posts: 1
Registered: ‎07-09-2018

pyspark, how to perform code never ending

[ Edited ]

I have a huge dataset (1 TB) with 9 billions of different ID and the script I developped is never ending.

A sample of my dataset for one ID :

 

    ID  day   location1 location2 
    a   05/01    Rome     Paris    a   08/01    Zurich   Amsterdam    a   09/01    None     Rome

What I whant:

    a   05/01    Rome       Paris    a   06/01    Paris      Paris    a   07/01    Paris      Paris    a   08/01    Zurich     Amsterdam    a   09/01    Amsterdam  Rome

As it is show in the exemple I need to add all the missing days for each user and to consider that the user is not moved during that days when I don't have any records.

Anyone has any idea to approch this problem in a efficient way?

Thanks

Announcements