1

I have a dataframe. I would like group by col1, order by col3 and detect changes from row to row in col2.

Here is my example:

import pandas as pd
import datetime

my_df = pd.DataFrame({'col1': ['a', 'a', 'a', 'b', 'b', 'b'],
                      'col2': [2, 2, 3, 5, 5, 5],
                      'col3': [datetime.date(2023, 2, 1),
                               datetime.date(2023, 3, 1),
                               datetime.date(2023, 4, 1),
                               datetime.date(2023, 2, 1),
                               datetime.date(2023, 3, 1),
                               datetime.date(2023, 4, 1)]})

my_df.sort_values(by=['col3'], inplace=True)
my_df_temp = my_df.groupby('col1')['col2'].apply(
    lambda x: x != x.shift(1)
).reset_index(name='col2_change')

Here is how my dataframe looks:

  col1  col2        col3
0    a     2  2023-02-01
1    a     2  2023-03-01
2    a     3  2023-04-01
3    b     5  2023-02-01
4    b     5  2023-03-01
5    b     5  2023-04-01

Here is how result looks like:

  col1  level_1  col2_change
0    a        0         True
1    a        1        False
2    a        2         True
3    b        3         True
4    b        4        False
5    b        5        False

This is clearly incorrect. What am I doing wrong?

7
  • What do you expect the output to be? [T, T , F, T , T , T]? Commented Nov 12, 2024 at 21:54
  • my_df.groupby('col1')['col2'].apply(lambda x: x.shift().bfill() == x) Commented Nov 12, 2024 at 21:58
  • @Scott I guess they want [F, F, T, F, F, F] since 2 and 3 differ in group a. Commented Nov 12, 2024 at 23:30
  • You have two parts of the question down, so please focus on just the part you're having the problem with. E.g. a better title could be "Why is the first row always True when I'm trying to detect changes between rows?" See How to Ask if you want tips on writing a good title. Commented Nov 12, 2024 at 23:39
  • 2
    Please add the exact expected output. Commented Nov 13, 2024 at 7:16

2 Answers 2

1

First of all, your issue is not obvious, you should provide the expected output for clarity.

I imagine that you want to add a new column and keep the original existing columns unchanged. For that you would need to use groupby.transform:

my_df['col2_change'] = (my_df
                        .groupby('col1')['col2']
                        .transform(lambda x: x != x.shift())
                       )

Variant with groupby.shift:

my_df['col2_change'] = (my_df
                        .groupby('col1')['col2']
                        .shift().ne(my_df['col2'])
                       )

In addition, if you don't want to map the first value of a group as True you could perform a double shift:

my_df['col2_change2'] = (my_df
                         .groupby('col1')['col2']
                         .transform(lambda x: x.ne(x.shift(-1))
                                               .shift(fill_value=False))
                        )

NB. a double shift is preferred to bfill that would incorrectly fill internal NaNs if any.

Or using duplicated and where:

my_df['col2_change2'] = (my_df
                         .groupby('col1')['col2']
                         .transform(lambda x: x != x.shift())
                         .where(my_df['col1'].duplicated(), False)
                       )

Output:

  col1  col2        col3  col2_change  col2_change2
0    a     2  2023-02-01         True         False
3    b     5  2023-02-01         True         False
1    a     2  2023-03-01        False         False
4    b     5  2023-03-01        False         False
2    a     3  2023-04-01         True          True
5    b     5  2023-04-01        False         False
Sign up to request clarification or add additional context in comments.

Comments

1

This should work. Input:

my_df = pd.DataFrame({'col1': ['a', 'a', 'a', 'a', 'b', 'b', 'b'],
                      'col2': [2, 3, 4, 4, 5, 5, 5],
                      'col3': [datetime.date(2023, 2, 1),
                               datetime.date(2023, 3, 1),
                               datetime.date(2023, 4, 1),
                               datetime.date(2023, 5, 1),
                               datetime.date(2023, 2, 1),
                               datetime.date(2023, 3, 1),
                               datetime.date(2023, 4, 1)]})

In the code I also used shift, but on grouped my_df. For observing changes, you can

my_df['change'] = my_df['col2'].ne(my_df.groupby('col1')['col2'].shift()).rename('change')
my_df_idx = my_df.copy()
my_df_idx['change'] = (my_df.groupby('col1')['change'].cumsum()-1).astype(bool)
my_df_idx = my_df.set_index('col1')
my_df_idx.set_index('col3', append=True, inplace=True)
my_df_idx.sort_index(inplace=True)
my_df_idx

Output:

col1   col2    col3 change
a   2023-02-01  2   True
    2023-03-01  3   True
    2023-04-01  4   True
    2023-05-01  4   False
b   2023-02-01  5   True
    2023-03-01  5   False
    2023-04-01  5   False

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.