0

I have sports data, exemplified by a running group with distance values associated to date of run and runner's name as per:

import pandas as pd

df=pd.DataFrame({'name': 'Jack Jill Bob Bella Norm Nella Jack Jill Bob Bella Norm Nella Jack Jill Bob Bella Norm Nella'.split(),
                'date': '05-04-2021 05-04-2021 05-04-2021 06-04-2021 05-04-2021 06-04-2021 06-04-2021 08-04-2021 11-04-2021 08-04-2021 11-04-2021 08-04-2021 11-04-2021 11-04-2021 15-04-2022 15-04-2022 18-04-2022 19-04-2022'.split(),
                'km': [5.85, 5.18, 13.58, 14.45, 14.58, 11.14, 8.85, 10.77, 12.54, 7.09, 7.69, 11.64, 9.82, 11.20, 10.33, 11.31, 14.66, 12.56]})

df['date']=pd.to_datetime(df['date'], infer_datetime_format=True)

I would like to groupby and filter date to provide a rolling, enlarging slice of data to aggregate on. I can do this using a loop and filtering on each unique date, which provides a series of summed km values with unique date subsequently added in as a separate column. The type of data and format I'm after is provided by this code.

for d in df.date.unique():
    rolling=df[df.date <= d].groupby('name').sum()
    rolling['date']=d

I would like to accomplish using .groupby(), as I have much more data and complexity in what I actually want to do. Happy to be guided to a pre-existing answer that I haven't found after searching...

1 Answer 1

0

The expected output is unclear, but assuming you want the cumulated km for each name for each date, you could use:

out = (df
 .groupby(['name', 'date']).sum()
 .groupby(level='name').cumsum()
 .reset_index()
)

output:

     name       date     km
0   Bella 2021-06-04  14.45
1   Bella 2021-08-04  21.54
2   Bella 2022-04-15  32.85
3     Bob 2021-05-04  13.58
4     Bob 2021-11-04  26.12
5     Bob 2022-04-15  36.45
6    Jack 2021-05-04   5.85
7    Jack 2021-06-04  14.70
8    Jack 2021-11-04  24.52
9    Jill 2021-05-04   5.18
10   Jill 2021-08-04  15.95
11   Jill 2021-11-04  27.15
12  Nella 2021-06-04  11.14
13  Nella 2021-08-04  22.78
14  Nella 2022-04-19  35.34
15   Norm 2021-05-04  14.58
16   Norm 2021-11-04  22.27
17   Norm 2022-04-18  36.93

The above output could conveniently be seen as a 2D table using pivot:

out2 = (df
 .groupby(['name', 'date']).sum()
 .groupby(level='name').cumsum()
 .reset_index()
 .pivot(index='date', columns='name', values='km')
)

output:

name        Bella    Bob   Jack   Jill  Nella   Norm
date                                                
2021-05-04    NaN  13.58   5.85   5.18    NaN  14.58
2021-06-04  14.45    NaN  14.70    NaN  11.14    NaN
2021-08-04  21.54    NaN    NaN  15.95  22.78    NaN
2021-11-04    NaN  26.12  24.52  27.15    NaN  22.27
2022-04-15  32.85  36.45    NaN    NaN    NaN    NaN
2022-04-18    NaN    NaN    NaN    NaN    NaN  36.93
2022-04-19    NaN    NaN    NaN    NaN  35.34    NaN
Sign up to request clarification or add additional context in comments.

2 Comments

Hey moz what is the difference between the cumsum method and the sum method?
cumsum is the cumulated sum, this gives a vector of the same size as the input. For example the cumsum of [1, 2, 4] is [1, 3, 7] ([1, 1+2, 1+2+4]). The sum is a single value, here this would be 7

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.