How to delete row with max/min values

Question

I have dataframe:

   one   N    th
0   A      5      1   
1   Z      17     0   
2   A      16     0   
3   B      9      1   
4   B      17     0   
5   B      117    1   
6   XC     35     1   
7   C      85     0    
8   Ce     965    1

I'm looking the way to keep alternating 0101 in column three without doubling 0 or 1. So, i want to delete row with min of values in case if i have two repeating 0 in column th and max values if i have repeating 1.

My base consis of 1000000 rows.

I expect to have dataframe like this:

   one   N    th
0   A      5      1   
1   Z      17     0   
3   B      9      1   
4   B      17     0    
6   XC     35     1   
7   C      85     0    
8   Ce     965    1

What is the fastest way to do it. I mean vectorized way. My attempts without result.

Can you please edit the question and add the expected output for this dataframe? — Dogbert
– Dogbert, Commented Jul 31, 2024 at 14:49
You forgot to post your attempt to solve even part of this problem. — Scott Hunter
– Scott Hunter, Commented Jul 31, 2024 at 14:52
If you're deleting the duplicate with minimum values, shouldn't you delete row 6 rather than 5? — Michael Cao
– Michael Cao, Commented Jul 31, 2024 at 15:25
"i want to delete row with min of values in case if i have two repeating 0 in column th and max values if i have repeating 1". I take that to mean: for duplicates with 0 delete the row with min value for N (i.e. index 2); for duplicates with 1 delete the row with max value for N (i.e. index 5). — ouroboros1
– ouroboros1, Commented Jul 31, 2024 at 15:46

mozway · Accepted Answer · 2024-07-31 18:12:07Z

using a custom `groupby.idxmax`

You can swap the sign if "th" is 1 (to get the max instead of min), then set up a custom grouper (with diff or shift + cumsum) and perform a groupby.idxmax to select the rows to keep:

out = df.loc[df['N'].mul(df['th'].map({0: 1, 1: -1}))
             .groupby(df['th'].ne(df['th'].shift()).cumsum())
             .idxmax()]

Variant with a different method to swap the sign and to compute the group:

out = df.loc[df['N'].mask(df['th'].eq(1), -df['N'])
             .groupby(df['th'].diff().ne(0).cumsum())
             .idxmax()]

Output:

  one    N  th
0   A    5   1
1   Z   17   0
3   B    9   1
4   B   17   0
6  XC   35   1
7   C   85   0
8  Ce  965   1

Intermediates:

  one    N  th  swap  group max
0   A    5   1    -5      1   X
1   Z   17   0    17      2   X
2   A   16   0    16      2    
3   B    9   1    -9      3   X
4   B   17   0    17      4   X
5   B  117   1  -117      5    
6  XC   35   1   -35      5   X
7   C   85   0    85      6   X
8  Ce  965   1  -965      7   X

using boolean masks

The above code works for an arbitrary number of consecutive 0s or 1s. If you know that you only have up to 2 successive ones, you could also use boolean indexing, which should be significantly faster:

# has the value higher precedence than the next?
D = df['N'].mask(df['th'].eq(1), -df['N']).diff()

# is the th different from the previous?
G = df['th'].ne(df['th'].shift(fill_value=-1))

# rule for the bottom row
m1 = D.gt(0) | G

# rule for the top row
# same rule as above but shifted up
# D is inverted
# comparison is not strict in case of equality
m2 = ( D.le(0).shift(-1, fill_value=True)
      | G.shift(-1, fill_value=True)
     )

# keep rows of interest
out = df.loc[m1&m2]

Output:

  one    N  th
0   A    5   1
1   Z   17   0
3   B    9   1
4   B   17   0
6  XC   35   1
7   C   85   0
8  Ce  965   1

Intermediates:

  one    N  th       D      G     m1     m2  m1&m2
0   A    5   1     NaN   True   True   True   True
1   Z   17   0    22.0   True   True   True   True
2   A   16   0    -1.0  False  False   True  False
3   B    9   1   -25.0   True   True   True   True
4   B   17   0    26.0   True   True   True   True
5   B  117   1  -134.0   True   True  False  False
6  XC   35   1    82.0  False   True   True   True
7   C   85   0   120.0   True   True   True   True
8  Ce  965   1 -1050.0   True   True   True   True

More complex example with equal values:

   one    N  th       D      G     m1     m2  m1&m2
0    A    5   1     NaN   True   True   True   True
1    Z   17   0    22.0   True   True   True   True
2    A   16   0    -1.0  False  False   True  False
3    B    9   1   -25.0   True   True   True   True
4    B   17   0    26.0   True   True   True   True
5    B  117   1  -134.0   True   True  False  False
6   XC   35   1    82.0  False   True   True   True
7    C   85   0   120.0   True   True   True   True
8   Ce  965   1 -1050.0   True   True   True   True
9    u  123   0  1088.0   True   True   True   True # because of D.le(0)
10   v  123   0     0.0  False  False   True  False # because or D.gt(0)

NB. in case of equality, it is possible to select the first/second row or both or none, depending on the operator used (D.le(0), D.lt(0), D.gt(0), D.ge(0)).

timings

Although limited to maximum 2 consecutive "th", the boolean mask approach is ~4-5x faster. Timed on 1M rows:

# groupby + idxmax
96.4 ms ± 6.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# boolean masks
22.2 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Soudipta Dutta · Accepted Answer · 2024-09-13 15:02:03Z

What you want is OR between the following 2 cases:

bb = (df['th'] == 0) & (df['N'] == df['max_N'])
aa = (df['th'] == 1) & (df['N'] == df['min_N'])
res = (aa | bb)

Code :

import pandas as pd
import numpy as np

data = {
    'one': ['A', 'Z', 'A', 'B', 'B', 'B', 'XC', 'C', 'Ce'],
    'N': [5, 17, 16, 9, 17, 117, 35, 85, 965],
    'th': [1, 0, 0, 1, 0, 1, 1, 0, 1]
}
df = pd.DataFrame(data)

df['group'] = df['th'].ne(df['th'].shift()).cumsum()
"""
  one    N  th  group
0   A    5   1      1
1   Z   17   0      2
2   A   16   0      2
3   B    9   1      3
4   B   17   0      4
5   B  117   1      5
6  XC   35   1      5
7   C   85   0      6
8  Ce  965   1      7
"""
df['max_n'] = df.groupby(['th','group'])['N'].transform('max')
df['min_n'] = df.groupby(['th','group'])['N'].transform('min')
"""
  one    N  th  group  max_n  min_n
0   A    5   1      1      5      5
1   Z   17   0      2     17     16
2   A   16   0      2     17     16
3   B    9   1      3      9      9
4   B   17   0      4     17     17
5   B  117   1      5    117     35
6  XC   35   1      5    117     35
7   C   85   0      6     85     85
8  Ce  965   1      7    965    965
"""

aa = (df['th'] == 0) & (df['N'] == df['max_n'])
bb = (df['th'] == 1) & (df['N'] == df['min_n'])
df['keep'] = (aa | bb)
"""
 one    N  th  group  max_n  min_n   keep
0   A    5   1      1      5      5   True
1   Z   17   0      2     17     16   True
2   A   16   0      2     17     16  False
3   B    9   1      3      9      9   True
4   B   17   0      4     17     17   True
5   B  117   1      5    117     35  False
6  XC   35   1      5    117     35   True
7   C   85   0      6     85     85   True
8  Ce  965   1      7    965    965   True
"""
filtered_df = df[df['keep']]
"""
  one    N  th  group  max_n  min_n  keep
0   A    5   1      1      5      5  True
1   Z   17   0      2     17     16  True
3   B    9   1      3      9      9  True
4   B   17   0      4     17     17  True
6  XC   35   1      5    117     35  True
7   C   85   0      6     85     85  True
8  Ce  965   1      7    965    965  True
"""

One Liner(Just for Fun) :

res1 = (
df.assign(group =  df['th'].ne(df['th'].shift()).cumsum())
.assign(
max_n = lambda x : x.groupby(['th','group'])['N'].transform('max'),
min_n = lambda x : x.groupby(['th','group'])['N'].transform('min'),
keep  = lambda x : ((x['th'] == 0) & (x['N'] == x['max_n'])) | 
((x['th'] == 1) & (x['N'] == x['min_n']))
   
    )#ass
    
).query('keep')    

print(res1)

Use query(memory efficient):

df['group'] = df['th'].ne(df['th'].shift()).cumsum()


filtered_df4 = df.assign(max_n=df.groupby(['th', 'group'], sort=False)['N'].transform('max'),
                         min_n=df.groupby(['th', 'group'], sort=False)['N'].transform('min'))\
                .query("(th == 1 and N == min_n) or (th == 0 and N == max_n)")

print(filtered_df4)

Michael Cao · Accepted Answer · 2024-07-31 15:31:57Z

1

Create a group for each sequence of 0's and 1's by using shift to identify the start of a new group.
Do a groupby transform to identify the maximum of each of group
Filter down so that you only accept rows with N = max

df['group'] = (df['th'] != df['th'].shift(1)).cumsum()
df['max'] = df.groupby('group')['N'].transform('max')

df2 = df.loc[df['N'] == df['max']][['one', 'N', 'th']]

One liner version if you don't want to create the intermediate columns:

df.loc[df['N'] == df.groupby((df['th'] != df['th'].shift(1)).cumsum())['N'].transform('max')]

answered Jul 31, 2024 at 15:31

Michael Cao

3,7511 gold badge3 silver badges18 bronze badges

Comments

Ben Vaughan · Accepted Answer · 2024-07-31 15:20:31Z

0

wouldn't it be very easy just to make a variable. And then in a loop, set it to the first one, and then compare it to the next one. If it is the same, and is 0, delete the lowest; if 1, delete the highest. If they are different, set the variable to the next one and continue the loop

answered Jul 31, 2024 at 15:20

Ben Vaughan

11

1 Comment

michaelt Over a year ago

It's better to use the dataframe to do the work. But if you really wanted to, you could create a loop. I'll add it as a solution, but it certainly isn't the best way to do this.

michaelt · Accepted Answer · 2024-07-31 17:57:02Z

Adding this because of one of the comments. With regards to the iterative way, described below, it isn't really a technique you would want to do as it doesn't leverage Pandas. Adding it for completeness as, if you compare to the other solutions, it's less succint.

data = [
    [0, 'A', 5, 1],
    [1, 'Z', 17, 0],
    [2, 'A', 16, 0],
    [3, 'B', 9, 1],
    [4, 'B', 17, 0],
    [5, 'B', 117, 1],
    [6, 'XC', 35, 1],
    [7, 'C', 85, 0],
    [8, 'Ce', 965, 1]
]

df = pd.DataFrame(data, columns=['id', 'one', 'N', 'th'])

def ensure_alternating_th(df):
    while True:
        repeats_found = False
        idx_to_remove = []

        for idx in range(1, len(df)):
            # check for repeated values in 'th' column
            if df.at[idx, 'th'] == df.at[idx - 1, 'th']:
                repeats_found = True
                if df.at[idx, 'th'] == 0:
                    # Drop row with minimum 'N' where 'th' == 0
                    min_row_idx = df.iloc[[idx - 1, idx]]['N'].idxmin()
                elif df.at[idx, 'th'] == 1:
                    # Drop row with maximum 'N' where 'th' == 1
                    max_row_idx = df.iloc[[idx - 1, idx]]['N'].idxmax()
                idx_to_remove.append(min_row_idx if df.at[idx, 'th'] == 0 else max_row_idx)

        if not repeats_found:
            break

        # remove identified rows and reset index
        df = df.drop(idx_to_remove).reset_index(drop=True)

    return df

df_cleaned = ensure_alternating_th(df)

"""
# Returns
   id   one N   th
0   0   A   5   1
1   1   Z   17  0
2   3   B   9   1
3   4   B   17  0
4   6   XC  35  1
5   7   C   85  0
6   8   Ce  965 1
"""

Collectives™ on Stack Overflow

How to delete row with max/min values

5 Answers 5

using a custom `groupby.idxmax`

using boolean masks

timings

Comments

Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

using a custom groupby.idxmax

using boolean masks

timings

Comments

Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related

using a custom `groupby.idxmax`