pandas groupby on date to give rolling data slices

Question

I have sports data, exemplified by a running group with distance values associated to date of run and runner's name as per:

import pandas as pd

df=pd.DataFrame({'name': 'Jack Jill Bob Bella Norm Nella Jack Jill Bob Bella Norm Nella Jack Jill Bob Bella Norm Nella'.split(),
                'date': '05-04-2021 05-04-2021 05-04-2021 06-04-2021 05-04-2021 06-04-2021 06-04-2021 08-04-2021 11-04-2021 08-04-2021 11-04-2021 08-04-2021 11-04-2021 11-04-2021 15-04-2022 15-04-2022 18-04-2022 19-04-2022'.split(),
                'km': [5.85, 5.18, 13.58, 14.45, 14.58, 11.14, 8.85, 10.77, 12.54, 7.09, 7.69, 11.64, 9.82, 11.20, 10.33, 11.31, 14.66, 12.56]})

df['date']=pd.to_datetime(df['date'], infer_datetime_format=True)

I would like to groupby and filter date to provide a rolling, enlarging slice of data to aggregate on. I can do this using a loop and filtering on each unique date, which provides a series of summed km values with unique date subsequently added in as a separate column. The type of data and format I'm after is provided by this code.

for d in df.date.unique():
    rolling=df[df.date <= d].groupby('name').sum()
    rolling['date']=d

I would like to accomplish using .groupby(), as I have much more data and complexity in what I actually want to do. Happy to be guided to a pre-existing answer that I haven't found after searching...

mozway · Accepted Answer · 2022-06-27 12:07:15Z

0

The expected output is unclear, but assuming you want the cumulated km for each name for each date, you could use:

out = (df
 .groupby(['name', 'date']).sum()
 .groupby(level='name').cumsum()
 .reset_index()
)

output:

     name       date     km
0   Bella 2021-06-04  14.45
1   Bella 2021-08-04  21.54
2   Bella 2022-04-15  32.85
3     Bob 2021-05-04  13.58
4     Bob 2021-11-04  26.12
5     Bob 2022-04-15  36.45
6    Jack 2021-05-04   5.85
7    Jack 2021-06-04  14.70
8    Jack 2021-11-04  24.52
9    Jill 2021-05-04   5.18
10   Jill 2021-08-04  15.95
11   Jill 2021-11-04  27.15
12  Nella 2021-06-04  11.14
13  Nella 2021-08-04  22.78
14  Nella 2022-04-19  35.34
15   Norm 2021-05-04  14.58
16   Norm 2021-11-04  22.27
17   Norm 2022-04-18  36.93

The above output could conveniently be seen as a 2D table using pivot:

out2 = (df
 .groupby(['name', 'date']).sum()
 .groupby(level='name').cumsum()
 .reset_index()
 .pivot(index='date', columns='name', values='km')
)

output:

name        Bella    Bob   Jack   Jill  Nella   Norm
date                                                
2021-05-04    NaN  13.58   5.85   5.18    NaN  14.58
2021-06-04  14.45    NaN  14.70    NaN  11.14    NaN
2021-08-04  21.54    NaN    NaN  15.95  22.78    NaN
2021-11-04    NaN  26.12  24.52  27.15    NaN  22.27
2022-04-15  32.85  36.45    NaN    NaN    NaN    NaN
2022-04-18    NaN    NaN    NaN    NaN    NaN  36.93
2022-04-19    NaN    NaN    NaN    NaN  35.34    NaN

answered Jun 27, 2022 at 12:07

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

INGl0R1AM0R1 Over a year ago

Hey moz what is the difference between the cumsum method and the sum method?

mozway Over a year ago

cumsum is the cumulated sum, this gives a vector of the same size as the input. For example the cumsum of [1, 2, 4] is [1, 3, 7] ([1, 1+2, 1+2+4]). The sum is a single value, here this would be 7

Collectives™ on Stack Overflow

pandas groupby on date to give rolling data slices

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related