group by, order and detect change between rows

Question

I have a dataframe. I would like group by col1, order by col3 and detect changes from row to row in col2.

Here is my example:

import pandas as pd
import datetime

my_df = pd.DataFrame({'col1': ['a', 'a', 'a', 'b', 'b', 'b'],
                      'col2': [2, 2, 3, 5, 5, 5],
                      'col3': [datetime.date(2023, 2, 1),
                               datetime.date(2023, 3, 1),
                               datetime.date(2023, 4, 1),
                               datetime.date(2023, 2, 1),
                               datetime.date(2023, 3, 1),
                               datetime.date(2023, 4, 1)]})

my_df.sort_values(by=['col3'], inplace=True)
my_df_temp = my_df.groupby('col1')['col2'].apply(
    lambda x: x != x.shift(1)
).reset_index(name='col2_change')

Here is how my dataframe looks:

  col1  col2        col3
0    a     2  2023-02-01
1    a     2  2023-03-01
2    a     3  2023-04-01
3    b     5  2023-02-01
4    b     5  2023-03-01
5    b     5  2023-04-01

Here is how result looks like:

  col1  level_1  col2_change
0    a        0         True
1    a        1        False
2    a        2         True
3    b        3         True
4    b        4        False
5    b        5        False

This is clearly incorrect. What am I doing wrong?

my_df.groupby('col1')['col2'].apply(lambda x: x.shift().bfill() == x) — Scott Boston
– Scott Boston, Commented Nov 12, 2024 at 21:58
@Scott I guess they want [F, F, T, F, F, F] since 2 and 3 differ in group a. — wjandrea
– wjandrea, Commented Nov 12, 2024 at 23:30
You have two parts of the question down, so please focus on just the part you're having the problem with. E.g. a better title could be "Why is the first row always True when I'm trying to detect changes between rows?" See How to Ask if you want tips on writing a good title. — wjandrea
– wjandrea, Commented Nov 12, 2024 at 23:39

mozway · Accepted Answer · 2024-11-13 07:51:51Z

First of all, your issue is not obvious, you should provide the expected output for clarity.

I imagine that you want to add a new column and keep the original existing columns unchanged. For that you would need to use groupby.transform:

my_df['col2_change'] = (my_df
                        .groupby('col1')['col2']
                        .transform(lambda x: x != x.shift())
                       )

Variant with groupby.shift:

my_df['col2_change'] = (my_df
                        .groupby('col1')['col2']
                        .shift().ne(my_df['col2'])
                       )

In addition, if you don't want to map the first value of a group as True you could perform a double shift:

my_df['col2_change2'] = (my_df
                         .groupby('col1')['col2']
                         .transform(lambda x: x.ne(x.shift(-1))
                                               .shift(fill_value=False))
                        )

NB. a double shift is preferred to bfill that would incorrectly fill internal NaNs if any.

Or using duplicated and where:

my_df['col2_change2'] = (my_df
                         .groupby('col1')['col2']
                         .transform(lambda x: x != x.shift())
                         .where(my_df['col1'].duplicated(), False)
                       )

Output:

  col1  col2        col3  col2_change  col2_change2
0    a     2  2023-02-01         True         False
3    b     5  2023-02-01         True         False
1    a     2  2023-03-01        False         False
4    b     5  2023-03-01        False         False
2    a     3  2023-04-01         True          True
5    b     5  2023-04-01        False         False

yellow_dot · Accepted Answer · 2024-11-13 08:11:12Z

This should work. Input:

my_df = pd.DataFrame({'col1': ['a', 'a', 'a', 'a', 'b', 'b', 'b'],
                      'col2': [2, 3, 4, 4, 5, 5, 5],
                      'col3': [datetime.date(2023, 2, 1),
                               datetime.date(2023, 3, 1),
                               datetime.date(2023, 4, 1),
                               datetime.date(2023, 5, 1),
                               datetime.date(2023, 2, 1),
                               datetime.date(2023, 3, 1),
                               datetime.date(2023, 4, 1)]})

In the code I also used shift, but on grouped my_df. For observing changes, you can

my_df['change'] = my_df['col2'].ne(my_df.groupby('col1')['col2'].shift()).rename('change')
my_df_idx = my_df.copy()
my_df_idx['change'] = (my_df.groupby('col1')['change'].cumsum()-1).astype(bool)
my_df_idx = my_df.set_index('col1')
my_df_idx.set_index('col3', append=True, inplace=True)
my_df_idx.sort_index(inplace=True)
my_df_idx

Output:

col1   col2    col3 change
a   2023-02-01  2   True
    2023-03-01  3   True
    2023-04-01  4   True
    2023-05-01  4   False
b   2023-02-01  5   True
    2023-03-01  5   False
    2023-04-01  5   False

Collectives™ on Stack Overflow

group by, order and detect change between rows

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related