Skip to content

RFC: Generalizing the GC API to support group operators #1975

@ryzhyk

Description

@ryzhyk

Consider the problem of garbage collecting an integral used as input to the LAG(1) operator. Consider a group of values with a common key. Assume that the group is sorted by one of the table columns with lateness. waterline below denotes the timestamp such that no values before this timestamp can change. For the LAG(1) operator to work correctly, we must retain at least the latest value preceding the waterline, no matter how far in the past, and all newer values.

                   the oldest value                                                        
                 that must be preserved                                                    
                       │                                                                   
                       │                                                                   
                       ▼                                                                   
───o───────o─────────o-o────────────────o───────o──────o───────────────────────────►
                                    ▲                                              time
                                    │                                                      
                                    │                                                      
                                    │                                                      
                                    │                                                      
                                waterline                                                  

In general, LAG(n) requires storing n most recent values preceding the waterline. Similarly, top(k) and related operators require storing k rows or all recent rows whose total weight doesn't exceed k, depending on the exact flavor of the operator. Finally, the asof-join operator also requires storing 1 row before the waterline in the right-hand collection.

We currently don't have a way to express such bounds. The integrate_trace_retain_values operator now takes a filter function that filters individual values: Fn(&B::Val, &TS) -> bool.

One possible generalization is to instead have integrate_trace_retain_values accept a function that maps the entire group into a subset of its elements. Internally, it would iterate over the group in reverse, up to the first element following the waterline and drop everything after. I need to think about the best way to express this, but this would be the crux of it.

Of course, all the hard work would be done by the compiler, which needs to generate these more complex functions: in the general case there can be multiple such "group transformers", in addition to the old-style filters.

Metadata

Metadata

Labels

DBSP coreRelated to the core DBSP librarySQL compilerRelated to the SQL compiler

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions