Support Questions
Find answers, ask questions, and share your expertise

PySpark Dataframe: Add an automatically incrementing field grouped by another field

New Contributor

Hi,

 

I have a PySpark Dataframe containing data which has a timestamp and an item identifier as follows:

 

|datereceived |messagetype|

|2015-01-01 00:00:29|          1|

|2015-01-01 00:01:22|          1|

|2015-01-01 00:04:19|          1|

|2015-01-01 00:10:39|          1|

|2015-01-01 00:00:59|          2|

|2015-01-01 00:03:11|          2|

|2015-01-01 00:06:33|          2|

|2015-01-01 00:00:11|          3|

|2015-01-01 00:00:59|          3|

 

I would like to add an incrementing integer in a new column which increments in chronological order based on the datereceived field, and resets each time it comes across a new 'messagetype'... but I can't for the life of me figure out how to do it. To be fair, I am very new to PySpark - I'd have no problem doing this in MSSQL!

 

What I would like to see would be:

 

|datereceived |messagetype|index|

|2015-01-01 00:00:29|          1| 1|

|2015-01-01 00:01:22|          1| 2|

|2015-01-01 00:04:19|          1| 3|

|2015-01-01 00:10:39|          1| 4|

|2015-01-01 00:00:59|          2| 1|

|2015-01-01 00:03:11|          2| 2|

|2015-01-01 00:06:33|          2| 3|

|2015-01-01 00:00:11|          3| 1|

|2015-01-01 00:00:59|          3| 2|

 

Could anyone help me please?

 

I really appreciate any help anyone can offer.

0 REPLIES 0