Conversation
The closure passed in is really quite generic, not as previously described (which was more accurate for `Runtime::init_circuit`). Signed-off-by: Ben Pfaff <blp@feldera.com>
Commit 3a25533 ("dbsp: Persist spine's in-memory batches to storage during commit.") made the commit operation on a spine persist all of its batches, but it also caused each batch to be cloned before being written if it was in the process of being merged (because `Arc::unwrap_or_clone()` does a clone if there's another reference, and a merge is the case where there'd be another reference). There's a better way that doesn't have this cost, and this commit implements it. This commit also reverts `map_batches_mut` to the previous version, which was fine for its purpose before and we might as well keep it since this stops using it for the new purpose. This is a speculative optimization only, because our production code currently doesn't ever commit. Signed-off-by: Ben Pfaff <blp@feldera.com>
Signed-off-by: Ben Pfaff <blp@feldera.com>
The `Time` associated type for a batch always has a default value, and that value is always suitable for an empty batch. Therefore, this commit switches to using the default and dropping the parameter. Also, the default implementation for these methods is always OK, so this commit drops some non-default implementations that were just cut-and-paste of the defaults anyhow. Signed-off-by: Ben Pfaff <blp@feldera.com>
The only reason this required `Time = ()` was a call to `dyn_empty`, which needed a time parameter. We could have just used `B::Time::default()` but with the previous commit we don't even have to do that. Signed-off-by: Ben Pfaff <blp@feldera.com>
The parker in `std` works fine and avoids an extra thread-local variable. Signed-off-by: Ben Pfaff <blp@feldera.com>
Once we use a single kill signal, `WorkerHandle` and `BackgroundWorkerHandle` become just a wrapper around `JoinHandle`, so we also drop the wrappers and just use plain `JoinHandle`. Signed-off-by: Ben Pfaff <blp@feldera.com>
I don't know what this was for. It wasn't used for anything other than being printed in the `Debug` output (maybe that's what it was for?), and it was sometimes assigned an arbitrary value like 0xbeef. Signed-off-by: Ben Pfaff <blp@feldera.com>
Signed-off-by: Ben Pfaff <blp@feldera.com>
An upcoming commit will change the way the spine communicates with the background merger. For that purpose, it makes sense for all the logic for working with the background threads to be integrated into the merger itself rather than sprinkled into the runtime as well. This commit does that. Signed-off-by: Ben Pfaff <blp@feldera.com>
The existing asynchronous merger works in terms of one merger at a time. The spine sends it a merge task and receives back the merge results. Since it does it one at a time, the spine can only start a new merge when it gets a chance to run itself. That limits the merge rate in significant ways. This commit rewrites the merger to be fully asynchronous. The spine hands the merger all of its batches. The merger thread then merges all of those batches. When the spine gets a chance to run, it fetches the new collection of batches, which might have completely changed in the meantime. This produces much better performance. Signed-off-by: Ben Pfaff <blp@feldera.com>
|
Here's the Nexmark SQL test results for 100M events before this PR: and after: |
|
@blp Nice! Looks q9's memory usage has dropped dramatically with this PR. q15 seems to have gotten slightly slower. There seems to be this weird thing where we're faster with storage in several cases, but I assume that's measurement variability. |
Yep! Most of them dropped.
q15 is really odd. It is by far the slowest query, even though it's a simple query. I plan to investigate it at some point. q15 is just:
It's fairly consistent in some cases. I haven't investigated, so I can only guess, and your guess is probably as good as mine at this point. |
|
The one thing I noticed about q15 which is different from others is that it seems to create a lot more spines |
|
Probably worth looking at the query plan and profile for oddities (but separately from this PR). |
|
I will take a look at the plan. But each distinct requires a spine, and each aggregate requires 2. |
That's a useful observation. Maybe it's an opportunity to tune for that case. For example, maybe the per-step fuel should be smaller if there's a lot of spines, or maybe my big comment is relevant here. |
We can't allow empty batches to enter the asynchronous spine, because the `CursorList` that the spine uses does not tolerate empty batches. If it did tolerate them, then they'd waste time anyhow. We didn't intend to allow empty batches in the spine, but there were two cases where it could happen. First, a merge could produce an empty batch. Second, `truncate_keys_below` and `recede_to` could insert empty batches via `resume`. This fixes both cases. This also renames `add_batch` to clarify how it's different from `add_batches`. Thanks to Lalith for reporting the problem and Gerd for diagnosing it. Signed-off-by: Ben Pfaff <blp@feldera.com>
|
everything red has state, i.e., spine |
|
I don't really see anything that could be improved here |
|
this is without LATENESS |
|
Would multi-join help? |
| level: usize, | ||
| /// A key that is unique within the level. | ||
| /// Invariant: the batches (if present) must be non-empty. | ||
| merging_batches: Option<[Arc<B>; 2]>, |
There was a problem hiding this comment.
This looks like we can only process one merge per-level at any time? Is it a problem for small size classes?
There was a problem hiding this comment.
This looks like we can only process one merge per-level at any time?
Yes.
Is it a problem for small size classes?
I don't think so. My theory is that merging has two benefits:
- To make iteration and searching cheaper because we have fewer batches to iterate and search in our
CursorLists. - In some cases only, to reduce the total amount of data, because weights summarize multiple items or cancel each other out or because filters drop data.
We only get these benefits when we complete a merge. An ongoing but incomplete merge only has a cost (the CPU we invested in it, plus memory or storage), with no benefits. Therefore, it is to our benefit to both complete as many merges as possible and to have as few ongoing merges as possible.
Since the smallest merges are cheapest, one might conclude that we should only do one merge at a time and that that should be the two smallest batches. In a static scenario, that would indeed be the right choice. But we tend to be getting a new smallest batch on every step. With that, the strategy of always doing the smallest merge piles up larger batches that never get merged. For example, suppose that we add a new batch with size 1 in every step, we start merging the smallest runs whenever we first a merge, and that a merge takes two steps. Then we end up with something like this, where each line is a step that adds a new batch of size 1, () designates that batches are being merged, and -> shows that a merge was finished or started within the step:
1
1 1 -> (1 1)
1 (1 1)
1 1 (1 1) -> 1 1 2 -> 2 (1 1)
1 2 (1 1)
1 1 2 (1 1) -> 1 1 2 2 -> 2 2 (1 1)
1 2 2 (1 1)
1 1 2 2 (1 1) -> 1 1 2 2 2 -> 2 2 2 (1 1)
Basically we pile up runs that are slightly longer than the shortest and they never get merged.
So, we still want to complete as many merges as possible and to have as few ongoing merges as possible, but "as few ongoing merges as possible" needs to be more than one and needs to include merges that are bigger than the smallest possible merge. The new merge design does both. I don't know whether it is the best possible, it's only the one that I hit on first.
There was a problem hiding this comment.
The "fuel" assigned for merges at each step should be log N, where N is the total size of the set.
This is the amortized cost you will have to eventually pay.
So you shouldn't stop merging with the small batches of size 1 if there is anything else that can be merged.
But sometimes there won't be any candidate for merging yet.
There was a problem hiding this comment.
The "fuel" assigned for merges at each step should be log N, where N is the total size of the set. This is the amortized cost you will have to eventually pay. So you shouldn't stop merging with the small batches of size 1 if there is anything else that can be merged. But sometimes there won't be any candidate for merging yet.
I don't think this answers anything though. We're not assigning fuel to steps, because merges and steps are asynchronous. Instead, the question is, at a given time, which merge or merges should we work on and how much? A merge of batches with length A and B has cost O(A+B) but the benefit when it completes is O(1) regardless of A and B (unless the merged version has length less than A+B).
There was a problem hiding this comment.
Yes, I am saying that if you are doing less than work log(N) you will have to pay for it later.
Ideally you always do around log(N) work in a merge.
There was a problem hiding this comment.
Yes, I am saying that if you are doing less than work log(N) you will have to pay for it later. Ideally you always do around log(N) work in a merge.
Can you recommend any reading material? I don't know how to apply this or where it comes from.
There was a problem hiding this comment.
The third benefit is that during merging we will GC old data flagged by the garbage collector. This will be most effective if we prioritize merging batches with the oldest data.
There was a problem hiding this comment.
The third benefit is that during merging we will GC old data flagged by the garbage collector. This will be most effective if we prioritize merging batches with the oldest data.
I think that's covered by my case 2.
All these joins are joining on the empty key since these are global aggregates. |
|
I mean, there will be fewer spines, but all of them should be tiny |
|
Each distinct is applied to a different value (a function of the data). I don't see how you can can combine them.
The aggregate different sets. Moreover, most aggregates are linear. When we improve our aggregate implementations all aggregates will look like this.
Each join has 1 input and 1 output record. |
|
I expect all memory consumption is in the integrators for the distinct operators. |
I run them with: |
|
This is an "infinite state" query, where none of the projections of the bid table on the "auctions" or "bidder" fields can be GC-ed due to the distinct. But hopefully these won't grow as fast as the "bids" table itself. |
Signed-off-by: Ben Pfaff <blp@feldera.com>
This includes a bug fix to `BackgroundThread::run` in the case where all the workers exited: it would park itself instead of exiting. Signed-off-by: Ben Pfaff <blp@feldera.com>
| level: usize, | ||
| /// A key that is unique within the level. | ||
| /// Invariant: the batches (if present) must be non-empty. | ||
| merging_batches: Option<[Arc<B>; 2]>, |
There was a problem hiding this comment.
The third benefit is that during merging we will GC old data flagged by the garbage collector. This will be most effective if we prioritize merging batches with the oldest data.
Some of the tests in `crates/dbsp/src/operator/dynamic/input.rs` depended on unspecified behavior in the spine implementation: the spine does not guarantee that key and value filters are applied at any particular time, but the tests relied on the particular behavior of the previous implementation. The tests using `input_map_updates1` relied on the spine not filtering the value (1,91) immediately when the waterline was raised, and thus the value was reported as removed. The new spine sometimes filters it that quickly, however, so the test became unreliable. This commit changes the test to use a different key (5 instead of 1) for raising the waterline. This keeps the map implementation from sometimes reporting a removal. The tests using `input_map_updates2` relied on the spine not filtering the value (1,1) immediately when the waterline was raised. The old spine never did; the new spine sometimes does. This commit comments out the part of the test that relied on that behavior. See #2119. Signed-off-by: Ben Pfaff <blp@feldera.com>
The new spine is likely to start merging before the filter gets updated, if a batch is inserted before a filter update. This commit reverses the order so that the filter will definitely be correct. This fails in CI but not on my local machine so it's hard to test. Signed-off-by: Ben Pfaff <blp@feldera.com>

This commit rewrites the asynchronous merger to be more asynchronous and faster. It is also simpler--this commit deletes more lines than it adds.
As of now, this is still going through CI, but I thought it was important to get review started.
I'm also running some benchmarks for comparative purposes, but here are a few to get started:
(The RAM above is peak; the asverage would be smaller.)