-
Notifications
You must be signed in to change notification settings - Fork 108
RFC: Generalizing the GC API to support group operators #1975
Description
Consider the problem of garbage collecting an integral used as input to the LAG(1) operator. Consider a group of values with a common key. Assume that the group is sorted by one of the table columns with lateness. waterline below denotes the timestamp such that no values before this timestamp can change. For the LAG(1) operator to work correctly, we must retain at least the latest value preceding the waterline, no matter how far in the past, and all newer values.
the oldest value
that must be preserved
│
│
▼
───o───────o─────────o-o────────────────o───────o──────o───────────────────────────►
▲ time
│
│
│
│
waterline
In general, LAG(n) requires storing n most recent values preceding the waterline. Similarly, top(k) and related operators require storing k rows or all recent rows whose total weight doesn't exceed k, depending on the exact flavor of the operator. Finally, the asof-join operator also requires storing 1 row before the waterline in the right-hand collection.
We currently don't have a way to express such bounds. The integrate_trace_retain_values operator now takes a filter function that filters individual values: Fn(&B::Val, &TS) -> bool.
One possible generalization is to instead have integrate_trace_retain_values accept a function that maps the entire group into a subset of its elements. Internally, it would iterate over the group in reverse, up to the first element following the waterline and drop everything after. I need to think about the best way to express this, but this would be the crux of it.
Of course, all the hard work would be done by the compiler, which needs to generate these more complex functions: in the general case there can be multiple such "group transformers", in addition to the old-style filters.