Pandas value_counts with a little tolerance

Question

I have a set of data and want to get its value_counts result:

df = pd.DataFrame(
    [5.01, 5.01, 5.08, 6.1, 5.54, 6.3, 5.56, 5.55, 6.7],
    columns=['val'])

>>> df
    val
0  5.01
1  5.01
2  5.08
3  6.10
4  5.54
5  6.30
6  5.56
7  5.55
8  6.70
>>> df.val.value_counts()
5.01    2
5.08    1
6.10    1
5.54    1
6.30    1
5.56    1
5.55    1
6.70    1
Name: val, dtype: int64

Is there a way to allow a certain tolerance when using value_counts, such as plus or minus 0.01, so that 5.54, 5.55, and 5.56 in the series are calculated as a group? The result I hope is:

[5.54,5.56,5.55] 3
[5.01] 2
[5.08] 1
[6.10] 1
...

wjandrea · Accepted Answer · 2025-01-05 02:46:22Z

2

Try this code:

tolerance = 0.01

sorted_vals = sorted(df['val'])

groups = []
current_group = [sorted_vals[0]]

for value in sorted_vals[1:]:
    if value - current_group[-1] <= tolerance:
        current_group.append(value)
    else:
        groups.append(current_group)
        current_group = [value]

groups.append(current_group)

group_counts = pd.DataFrame({
    'Group': groups,
    'Count': [len(group) for group in groups]
})

print(group_counts)

Output:

                Group  Count
0        [5.01, 5.01]      2
1              [5.08]      1
2  [5.54, 5.55, 5.56]      3
3               [6.1]      1
4               [6.3]      1
5               [6.7]      1

edited Jan 5 at 2:46

wjandrea

34k10 gold badges69 silver badges106 bronze badges

answered Jan 5 at 2:12

Dewmith Mihisara

2462 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sun Jar Jan 5 at 2:47

Thank you, it works, let’s see if there is a simpler way to solve it directly using pandas.

wjandrea Jan 5 at 2:49

@SunJar Yeah, this doesn't use any Pandas idioms. You could rewrite it using .diff() and a bit of grouping code.

wjandrea · Accepted Answer · 2025-01-06 19:58:57Z

1

try this:

mask = df['val'].sort_values().diff().gt(0.01)
result_df = df.groupby(mask.cumsum())['val'].agg([set, 'count'])
print(result_df)

                    set  count
val                           
0                {5.01}      2
1                {5.08}      1
2    {5.55, 5.54, 5.56}      3
3                 {6.1}      1
4                 {6.3}      1
5                 {6.7}      1

edited Jan 6 at 19:58

wjandrea

34k10 gold badges69 silver badges106 bronze badges

answered Jan 6 at 8:28

ziying35

1,3155 silver badges6 bronze badges

1 Comment

wjandrea Jan 6 at 21:02

This is very similar to a solution I posted. (That's not a complaint, I just wanted to mention it.)

wjandrea · Accepted Answer · 2025-01-06 20:22:25Z

There are two ways to go about this: grouping the sorted elements (which doesn't use .value_counts()) and binning.

Grouping sorted elements

Sort the values, compare each pair (using .diff()), then assign group numbers (using .cumsum()).

Then you can .groupby() and aggregate, getting the unique elements of each group and their size of course.

tolerance = 0.01
vals_sorted = df['val'].sort_values()
group_numbers = (
    vals_sorted
    .diff()
    .gt(tolerance)
    .cumsum()
    .rename('group_number')
)

vals_sorted.groupby(group_numbers).agg(['unique', 'size'])

                          unique  size
group_number                          
0                         [5.01]     2
1                         [5.08]     1
2             [5.54, 5.55, 5.56]     3
3                          [6.1]     1
4                          [6.3]     1
5                          [6.7]     1

Binning

Create equally-sized bins and pass them to .value_counts(). This is a shortcut for pd.cut().

Lastly, since this is a categorical value count, zeroes are included, so filter them out.

I also sorted the result by the index so it's easier to compare it against the first solution.

import numpy as np

tolerance = 0.01
start = df['val'].min()
stop = df['val'].max()
step = 3 * tolerance
bins = np.arange(start, stop+step, step)

df['val'].value_counts(bins=bins)[lambda s: s > 0].sort_index()

val
(5.0089999999999995, 5.04]    2
(5.07, 5.1]                   1
(5.52, 5.55]                  2
(5.55, 5.58]                  1
(6.09, 6.12]                  1
(6.27, 6.3]                   1
(6.69, 6.72]                  1
Name: count, dtype: int64

The result isn't quite what you want, but it's close. Maybe you'd want to adjust the start value, e.g. start = df['val'].min() - 2*tolerance.

Collectives™ on Stack Overflow

Pandas value_counts with a little tolerance

3 Answers 3

2 Comments

1 Comment

Grouping sorted elements

Binning

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Grouping sorted elements

Binning

Comments

Your Answer

Sign up or log in

Post as a guest

Related