Describe the bug
Applying ops.GroupBy(...) after ops.Filter(...) causes some weird behaviour. Some rows are filled with lists of nans, and rows are not groupped correctly. It seems like the problem is with indexes.
A bug related to #1767
Steps/Code to reproduce bug
Sample code:
import pandas as pd
import nvtabular as nvt
# dummy data
_event_id = [0, 1, 2, 3]
_session = ["a", "a", "a", "b"]
_category = ["x", "x", "x", "y"]
_event_type = ["start", "start", "stop", "start"]
input_df = pd.DataFrame(
{"event_id": _event_id, "session": _session, "category": _category, "event_type": _event_type}
)
print(input_df.head())
# graph
cat_feats = ["category"] >> nvt.ops.Categorify()
features = ["event_id", "session", "event_type"] + cat_feats
features = features >> nvt.ops.Filter(f=lambda df: df["event_type"] == "start")
groupby_features = features >> nvt.ops.Groupby(
groupby_cols=["session"],
aggs={
"event_id": "list",
"category": ["list", "count"],
"event_type": ["list"],
},
)
processor = nvt.Workflow(groupby_features)
dataset = nvt.Dataset(input_df)
output_df = processor.fit_transform(dataset)
print(output_df.head())
input_df looks like this:
event_id session category event_type
0 0 a x start
1 1 a x start
2 2 a x stop
3 3 b y start
And output_df (after filter and groupby):
session event_id_list category_list event_type_list category_count
0 a [0.0, 1.0, 3.0] [3.0, 3.0, 4.0] [start, start, start] 3
1 b [nan] [nan] [None] 0
Expected behavior
Expected output_df should look like this:
session event_id_list category_list event_type_list category_count
0 a [0, 1] [3, 3] [start, start] 2
1 b [3] [4] [start] 1
The event with event_id == 3 should be assigned to the session b, not a.
Dtype of columns event_id_list and category_list should be lists of ints not floats
Environment details (please complete the following information):
- Environment location: docker container (from nvidia/cuda:11.8.0-devel-ubi8)
- Method of NVTabular install: mamba
- nvtabular version: 23.8.0
Additional context
Related issue #1767 was about TypeError. In the output_df you can see, that the category_list column contains lists of floats (categories should be ints after ops.Categorify ) so they were converted in order to avoid TypeError.
I believe, that only the symptom of a bug was fixed there and not the cause. I think TypeError was an indirect result of the bug I describe in this issue. Since GroupBy causes some rows to be nans, there was a type conflict between original values (ints) and the nans (floats). But the real problem is that GroupBy after Filter messes up indexing and create some empty rows.
Describe the bug
Applying
ops.GroupBy(...)afterops.Filter(...)causes some weird behaviour. Some rows are filled with lists ofnans, and rows are not groupped correctly. It seems like the problem is with indexes.A bug related to #1767
Steps/Code to reproduce bug
Sample code:
input_dflooks like this:And
output_df(after filter and groupby):Expected behavior
Expected
output_dfshould look like this:The event with
event_id == 3should be assigned to the sessionb, nota.Dtype of columns
event_id_listandcategory_listshould be lists of ints not floatsEnvironment details (please complete the following information):
Additional context
Related issue #1767 was about
TypeError. In theoutput_dfyou can see, that thecategory_listcolumn contains lists of floats (categories should be ints afterops.Categorify) so they were converted in order to avoidTypeError.I believe, that only the symptom of a bug was fixed there and not the cause. I think
TypeErrorwas an indirect result of the bug I describe in this issue. SinceGroupBycauses some rows to benans, there was a type conflict between original values (ints) and the nans (floats). But the real problem is thatGroupByafterFiltermesses up indexing and create some empty rows.