3

I have dataframe:

   one   N    th
0   A      5      1   
1   Z      17     0   
2   A      16     0   
3   B      9      1   
4   B      17     0   
5   B      117    1   
6   XC     35     1   
7   C      85     0    
8   Ce     965    1 

I'm looking the way to keep alternating 0101 in column three without doubling 0 or 1. So, i want to delete row with min of values in case if i have two repeating 0 in column th and max values if i have repeating 1.

My base consis of 1000000 rows.

I expect to have dataframe like this:

   one   N    th
0   A      5      1   
1   Z      17     0   
3   B      9      1   
4   B      17     0    
6   XC     35     1   
7   C      85     0    
8   Ce     965    1 

What is the fastest way to do it. I mean vectorized way. My attempts without result.

6
  • 2
    Can you please edit the question and add the expected output for this dataframe? Commented Jul 31, 2024 at 14:49
  • 1
    You forgot to post your attempt to solve even part of this problem. Commented Jul 31, 2024 at 14:52
  • 1
    If you're deleting the duplicate with minimum values, shouldn't you delete row 6 rather than 5? Commented Jul 31, 2024 at 15:25
  • 1
    OK, got it, you take the min/max depending on 0/1 Commented Jul 31, 2024 at 15:45
  • 1
    "i want to delete row with min of values in case if i have two repeating 0 in column th and max values if i have repeating 1". I take that to mean: for duplicates with 0 delete the row with min value for N (i.e. index 2); for duplicates with 1 delete the row with max value for N (i.e. index 5). Commented Jul 31, 2024 at 15:46

5 Answers 5

3

using a custom groupby.idxmax

You can swap the sign if "th" is 1 (to get the max instead of min), then set up a custom grouper (with diff or shift + cumsum) and perform a groupby.idxmax to select the rows to keep:

out = df.loc[df['N'].mul(df['th'].map({0: 1, 1: -1}))
             .groupby(df['th'].ne(df['th'].shift()).cumsum())
             .idxmax()]

Variant with a different method to swap the sign and to compute the group:

out = df.loc[df['N'].mask(df['th'].eq(1), -df['N'])
             .groupby(df['th'].diff().ne(0).cumsum())
             .idxmax()]

Output:

  one    N  th
0   A    5   1
1   Z   17   0
3   B    9   1
4   B   17   0
6  XC   35   1
7   C   85   0
8  Ce  965   1

Intermediates:

  one    N  th  swap  group max
0   A    5   1    -5      1   X
1   Z   17   0    17      2   X
2   A   16   0    16      2    
3   B    9   1    -9      3   X
4   B   17   0    17      4   X
5   B  117   1  -117      5    
6  XC   35   1   -35      5   X
7   C   85   0    85      6   X
8  Ce  965   1  -965      7   X

using boolean masks

The above code works for an arbitrary number of consecutive 0s or 1s. If you know that you only have up to 2 successive ones, you could also use boolean indexing, which should be significantly faster:

# has the value higher precedence than the next?
D = df['N'].mask(df['th'].eq(1), -df['N']).diff()

# is the th different from the previous?
G = df['th'].ne(df['th'].shift(fill_value=-1))

# rule for the bottom row
m1 = D.gt(0) | G

# rule for the top row
# same rule as above but shifted up
# D is inverted
# comparison is not strict in case of equality
m2 = ( D.le(0).shift(-1, fill_value=True)
      | G.shift(-1, fill_value=True)
     )

# keep rows of interest
out = df.loc[m1&m2]

Output:

  one    N  th
0   A    5   1
1   Z   17   0
3   B    9   1
4   B   17   0
6  XC   35   1
7   C   85   0
8  Ce  965   1

Intermediates:

  one    N  th       D      G     m1     m2  m1&m2
0   A    5   1     NaN   True   True   True   True
1   Z   17   0    22.0   True   True   True   True
2   A   16   0    -1.0  False  False   True  False
3   B    9   1   -25.0   True   True   True   True
4   B   17   0    26.0   True   True   True   True
5   B  117   1  -134.0   True   True  False  False
6  XC   35   1    82.0  False   True   True   True
7   C   85   0   120.0   True   True   True   True
8  Ce  965   1 -1050.0   True   True   True   True

More complex example with equal values:

   one    N  th       D      G     m1     m2  m1&m2
0    A    5   1     NaN   True   True   True   True
1    Z   17   0    22.0   True   True   True   True
2    A   16   0    -1.0  False  False   True  False
3    B    9   1   -25.0   True   True   True   True
4    B   17   0    26.0   True   True   True   True
5    B  117   1  -134.0   True   True  False  False
6   XC   35   1    82.0  False   True   True   True
7    C   85   0   120.0   True   True   True   True
8   Ce  965   1 -1050.0   True   True   True   True
9    u  123   0  1088.0   True   True   True   True # because of D.le(0)
10   v  123   0     0.0  False  False   True  False # because or D.gt(0)

NB. in case of equality, it is possible to select the first/second row or both or none, depending on the operator used (D.le(0), D.lt(0), D.gt(0), D.ge(0)).

timings

Although limited to maximum 2 consecutive "th", the boolean mask approach is ~4-5x faster. Timed on 1M rows:

# groupby + idxmax
96.4 ms ± 6.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# boolean masks
22.2 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sign up to request clarification or add additional context in comments.

Comments

2

What you want is OR between the following 2 cases:

bb = (df['th'] == 0) & (df['N'] == df['max_N'])
aa = (df['th'] == 1) & (df['N'] == df['min_N'])
res = (aa | bb)

Code :

import pandas as pd
import numpy as np

data = {
    'one': ['A', 'Z', 'A', 'B', 'B', 'B', 'XC', 'C', 'Ce'],
    'N': [5, 17, 16, 9, 17, 117, 35, 85, 965],
    'th': [1, 0, 0, 1, 0, 1, 1, 0, 1]
}
df = pd.DataFrame(data)

df['group'] = df['th'].ne(df['th'].shift()).cumsum()
"""
  one    N  th  group
0   A    5   1      1
1   Z   17   0      2
2   A   16   0      2
3   B    9   1      3
4   B   17   0      4
5   B  117   1      5
6  XC   35   1      5
7   C   85   0      6
8  Ce  965   1      7
"""
df['max_n'] = df.groupby(['th','group'])['N'].transform('max')
df['min_n'] = df.groupby(['th','group'])['N'].transform('min')
"""
  one    N  th  group  max_n  min_n
0   A    5   1      1      5      5
1   Z   17   0      2     17     16
2   A   16   0      2     17     16
3   B    9   1      3      9      9
4   B   17   0      4     17     17
5   B  117   1      5    117     35
6  XC   35   1      5    117     35
7   C   85   0      6     85     85
8  Ce  965   1      7    965    965
"""

aa = (df['th'] == 0) & (df['N'] == df['max_n'])
bb = (df['th'] == 1) & (df['N'] == df['min_n'])
df['keep'] = (aa | bb)
"""
 one    N  th  group  max_n  min_n   keep
0   A    5   1      1      5      5   True
1   Z   17   0      2     17     16   True
2   A   16   0      2     17     16  False
3   B    9   1      3      9      9   True
4   B   17   0      4     17     17   True
5   B  117   1      5    117     35  False
6  XC   35   1      5    117     35   True
7   C   85   0      6     85     85   True
8  Ce  965   1      7    965    965   True
"""
filtered_df = df[df['keep']]
"""
  one    N  th  group  max_n  min_n  keep
0   A    5   1      1      5      5  True
1   Z   17   0      2     17     16  True
3   B    9   1      3      9      9  True
4   B   17   0      4     17     17  True
6  XC   35   1      5    117     35  True
7   C   85   0      6     85     85  True
8  Ce  965   1      7    965    965  True
"""

One Liner(Just for Fun) :

res1 = (
df.assign(group =  df['th'].ne(df['th'].shift()).cumsum())
.assign(
max_n = lambda x : x.groupby(['th','group'])['N'].transform('max'),
min_n = lambda x : x.groupby(['th','group'])['N'].transform('min'),
keep  = lambda x : ((x['th'] == 0) & (x['N'] == x['max_n'])) | 
((x['th'] == 1) & (x['N'] == x['min_n']))
   
    )#ass
    
).query('keep')    

print(res1)

Use query(memory efficient):

df['group'] = df['th'].ne(df['th'].shift()).cumsum()


filtered_df4 = df.assign(max_n=df.groupby(['th', 'group'], sort=False)['N'].transform('max'),
                         min_n=df.groupby(['th', 'group'], sort=False)['N'].transform('min'))\
                .query("(th == 1 and N == min_n) or (th == 0 and N == max_n)")

print(filtered_df4)

Comments

1
  1. Create a group for each sequence of 0's and 1's by using shift to identify the start of a new group.
  2. Do a groupby transform to identify the maximum of each of group
  3. Filter down so that you only accept rows with N = max

df['group'] = (df['th'] != df['th'].shift(1)).cumsum()
df['max'] = df.groupby('group')['N'].transform('max')

df2 = df.loc[df['N'] == df['max']][['one', 'N', 'th']]

One liner version if you don't want to create the intermediate columns:

df.loc[df['N'] == df.groupby((df['th'] != df['th'].shift(1)).cumsum())['N'].transform('max')]

Comments

0

wouldn't it be very easy just to make a variable. And then in a loop, set it to the first one, and then compare it to the next one. If it is the same, and is 0, delete the lowest; if 1, delete the highest. If they are different, set the variable to the next one and continue the loop

1 Comment

It's better to use the dataframe to do the work. But if you really wanted to, you could create a loop. I'll add it as a solution, but it certainly isn't the best way to do this.
0

Adding this because of one of the comments. With regards to the iterative way, described below, it isn't really a technique you would want to do as it doesn't leverage Pandas. Adding it for completeness as, if you compare to the other solutions, it's less succint.

data = [
    [0, 'A', 5, 1],
    [1, 'Z', 17, 0],
    [2, 'A', 16, 0],
    [3, 'B', 9, 1],
    [4, 'B', 17, 0],
    [5, 'B', 117, 1],
    [6, 'XC', 35, 1],
    [7, 'C', 85, 0],
    [8, 'Ce', 965, 1]
]

df = pd.DataFrame(data, columns=['id', 'one', 'N', 'th'])

def ensure_alternating_th(df):
    while True:
        repeats_found = False
        idx_to_remove = []

        for idx in range(1, len(df)):
            # check for repeated values in 'th' column
            if df.at[idx, 'th'] == df.at[idx - 1, 'th']:
                repeats_found = True
                if df.at[idx, 'th'] == 0:
                    # Drop row with minimum 'N' where 'th' == 0
                    min_row_idx = df.iloc[[idx - 1, idx]]['N'].idxmin()
                elif df.at[idx, 'th'] == 1:
                    # Drop row with maximum 'N' where 'th' == 1
                    max_row_idx = df.iloc[[idx - 1, idx]]['N'].idxmax()
                idx_to_remove.append(min_row_idx if df.at[idx, 'th'] == 0 else max_row_idx)

        if not repeats_found:
            break

        # remove identified rows and reset index
        df = df.drop(idx_to_remove).reset_index(drop=True)

    return df

df_cleaned = ensure_alternating_th(df)

"""
# Returns
   id   one N   th
0   0   A   5   1
1   1   Z   17  0
2   3   B   9   1
3   4   B   17  0
4   6   XC  35  1
5   7   C   85  0
6   8   Ce  965 1
"""

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.