0

I have a set of data and want to get its value_counts result:

df = pd.DataFrame(
    [5.01, 5.01, 5.08, 6.1, 5.54, 6.3, 5.56, 5.55, 6.7],
    columns=['val'])
>>> df
    val
0  5.01
1  5.01
2  5.08
3  6.10
4  5.54
5  6.30
6  5.56
7  5.55
8  6.70
>>> df.val.value_counts()
5.01    2
5.08    1
6.10    1
5.54    1
6.30    1
5.56    1
5.55    1
6.70    1
Name: val, dtype: int64

Is there a way to allow a certain tolerance when using value_counts, such as plus or minus 0.01, so that 5.54, 5.55, and 5.56 in the series are calculated as a group? The result I hope is:

[5.54,5.56,5.55] 3
[5.01] 2
[5.08] 1
[6.10] 1
...

3 Answers 3

2

Try this code:

tolerance = 0.01

sorted_vals = sorted(df['val'])

groups = []
current_group = [sorted_vals[0]]

for value in sorted_vals[1:]:
    if value - current_group[-1] <= tolerance:
        current_group.append(value)
    else:
        groups.append(current_group)
        current_group = [value]

groups.append(current_group)

group_counts = pd.DataFrame({
    'Group': groups,
    'Count': [len(group) for group in groups]
})

print(group_counts)

Output:

                Group  Count
0        [5.01, 5.01]      2
1              [5.08]      1
2  [5.54, 5.55, 5.56]      3
3               [6.1]      1
4               [6.3]      1
5               [6.7]      1
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, it works, let’s see if there is a simpler way to solve it directly using pandas.
@SunJar Yeah, this doesn't use any Pandas idioms. You could rewrite it using .diff() and a bit of grouping code.
1

try this:

mask = df['val'].sort_values().diff().gt(0.01)
result_df = df.groupby(mask.cumsum())['val'].agg([set, 'count'])
print(result_df)
                    set  count
val                           
0                {5.01}      2
1                {5.08}      1
2    {5.55, 5.54, 5.56}      3
3                 {6.1}      1
4                 {6.3}      1
5                 {6.7}      1

1 Comment

This is very similar to a solution I posted. (That's not a complaint, I just wanted to mention it.)
1

There are two ways to go about this: grouping the sorted elements (which doesn't use .value_counts()) and binning.

Grouping sorted elements

Sort the values, compare each pair (using .diff()), then assign group numbers (using .cumsum()).

Then you can .groupby() and aggregate, getting the unique elements of each group and their size of course.

tolerance = 0.01
vals_sorted = df['val'].sort_values()
group_numbers = (
    vals_sorted
    .diff()
    .gt(tolerance)
    .cumsum()
    .rename('group_number')
)

vals_sorted.groupby(group_numbers).agg(['unique', 'size'])
                          unique  size
group_number                          
0                         [5.01]     2
1                         [5.08]     1
2             [5.54, 5.55, 5.56]     3
3                          [6.1]     1
4                          [6.3]     1
5                          [6.7]     1

Binning

Create equally-sized bins and pass them to .value_counts(). This is a shortcut for pd.cut().

Lastly, since this is a categorical value count, zeroes are included, so filter them out.

I also sorted the result by the index so it's easier to compare it against the first solution.

import numpy as np

tolerance = 0.01
start = df['val'].min()
stop = df['val'].max()
step = 3 * tolerance
bins = np.arange(start, stop+step, step)

df['val'].value_counts(bins=bins)[lambda s: s > 0].sort_index()
val
(5.0089999999999995, 5.04]    2
(5.07, 5.1]                   1
(5.52, 5.55]                  2
(5.55, 5.58]                  1
(6.09, 6.12]                  1
(6.27, 6.3]                   1
(6.69, 6.72]                  1
Name: count, dtype: int64

The result isn't quite what you want, but it's close. Maybe you'd want to adjust the start value, e.g. start = df['val'].min() - 2*tolerance.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.