[BUG] ops.GroupBy after ops.Filter fails to group correctly, and produces unexpected NaNs

**Describe the bug**

Applying `ops.GroupBy(...)` after `ops.Filter(...)` causes some weird behaviour. Some rows are filled with lists of `nan`s, and rows are not groupped correctly. It seems like the problem is with indexes. 

A bug related to https://github.com/NVIDIA-Merlin/NVTabular/issues/1767

**Steps/Code to reproduce bug**
Sample code:
```python
import pandas as pd
import nvtabular as nvt

# dummy data
_event_id = [0, 1, 2, 3]
_session = ["a", "a", "a", "b"]
_category = ["x", "x", "x", "y"]
_event_type = ["start", "start", "stop", "start"]
input_df = pd.DataFrame(
    {"event_id": _event_id, "session": _session, "category": _category, "event_type": _event_type}
)
print(input_df.head())

# graph
cat_feats = ["category"] >> nvt.ops.Categorify()

features = ["event_id", "session", "event_type"] + cat_feats

features = features >> nvt.ops.Filter(f=lambda df: df["event_type"] == "start")

groupby_features = features >> nvt.ops.Groupby(
    groupby_cols=["session"],
    aggs={
        "event_id": "list",
        "category": ["list", "count"],
        "event_type": ["list"],
    },
)

processor = nvt.Workflow(groupby_features)
dataset = nvt.Dataset(input_df)

output_df = processor.fit_transform(dataset)
print(output_df.head())
```

`input_df` looks like this:
```
   event_id session category event_type
0         0       a        x      start
1         1       a        x      start
2         2       a        x       stop
3         3       b        y      start
```

And `output_df` (after filter and groupby):
```
  session    event_id_list    category_list        event_type_list     category_count  
0       a  [0.0, 1.0, 3.0]  [3.0, 3.0, 4.0]  [start, start, start]                  3 
1       b            [nan]            [nan]                 [None]                  0 
```

**Expected behavior**
Expected `output_df` should look like this:
```
  session event_id_list category_list       event_type_list  category_count
0       a        [0, 1]        [3, 3]        [start, start]               2
1       b           [3]           [4]               [start]               1
```
The event  with `event_id ==  3` should be assigned to the session `b`, not `a`. 
Dtype of columns `event_id_list` and `category_list` should be lists of ints not floats

**Environment details (please complete the following information):**

- Environment location: docker container (from nvidia/cuda:11.8.0-devel-ubi8)
- Method of NVTabular install: mamba
- nvtabular version: 23.8.0

**Additional context**

Related issue https://github.com/NVIDIA-Merlin/NVTabular/issues/1767 was about `TypeError`. In the `output_df` you can see, that the `category_list` column contains lists of floats (categories should be ints after `ops.Categorify` ) so they were converted in order to avoid `TypeError`.

I believe, that only the symptom of a bug was fixed there and not the cause. I think `TypeError` was an indirect result of the bug I describe in this issue. Since `GroupBy` causes some rows to be `nan`s, there was a type conflict between original values (ints) and the nans (floats).  But the real problem is that `GroupBy` after `Filter` messes up indexing and create some empty rows.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ops.GroupBy after ops.Filter fails to group correctly, and produces unexpected NaNs #1886

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] ops.GroupBy after ops.Filter fails to group correctly, and produces unexpected NaNs #1886

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions