Skip to content

Optimize asynchronous merger#2115

Merged
blp merged 16 commits intomainfrom
optimization
Jul 30, 2024
Merged

Optimize asynchronous merger#2115
blp merged 16 commits intomainfrom
optimization

Conversation

@blp
Copy link
Copy Markdown
Member

@blp blp commented Jul 29, 2024

This commit rewrites the asynchronous merger to be more asynchronous and faster. It is also simpler--this commit deletes more lines than it adds.

As of now, this is still going through CI, but I thought it was important to get review started.

I'm also running some benchmarks for comparative purposes, but here are a few to get started:

  • Without storage: without this series, Nexmark q9 takes 123 s and 114 GB RAM for 100M events; with this series, 56 s and 18 GB RAM
  • With storage: without this series, 125 s and 47 GB RAM; with this series, 76 s and 15 GB RAM.

(The RAM above is peak; the asverage would be smaller.)

blp added 11 commits July 27, 2024 09:48
The closure passed in is really quite generic, not as previously described
(which was more accurate for `Runtime::init_circuit`).

Signed-off-by: Ben Pfaff <blp@feldera.com>
Commit 3a25533 ("dbsp: Persist spine's in-memory batches to storage
during commit.") made the commit operation on a spine persist all of its
batches, but it also caused each batch to be cloned before being written
if it was in the process of being merged (because `Arc::unwrap_or_clone()`
does a clone if there's another reference, and a merge is the case where
there'd be another reference).

There's a better way that doesn't have this cost, and this commit
implements it.  This commit also reverts `map_batches_mut` to the
previous version, which was fine for its purpose before and we might as
well keep it since this stops using it for the new purpose.

This is a speculative optimization only, because our production code
currently doesn't ever commit.

Signed-off-by: Ben Pfaff <blp@feldera.com>
Signed-off-by: Ben Pfaff <blp@feldera.com>
The `Time` associated type for a batch always has a default value, and
that value is always suitable for an empty batch.  Therefore, this commit
switches to using the default and dropping the parameter.

Also, the default implementation for these methods is always OK, so this
commit drops some non-default implementations that were just cut-and-paste
of the defaults anyhow.

Signed-off-by: Ben Pfaff <blp@feldera.com>
The only reason this required `Time = ()` was a call to `dyn_empty`, which
needed a time parameter.  We could have just used `B::Time::default()` but
with the previous commit we don't even have to do that.

Signed-off-by: Ben Pfaff <blp@feldera.com>
The parker in `std` works fine and avoids an extra thread-local variable.

Signed-off-by: Ben Pfaff <blp@feldera.com>
Once we use a single kill signal, `WorkerHandle` and
`BackgroundWorkerHandle` become just a wrapper around `JoinHandle`, so
we also drop the wrappers and just use plain `JoinHandle`.

Signed-off-by: Ben Pfaff <blp@feldera.com>
I don't know what this was for.  It wasn't used for anything other than
being printed in the `Debug` output (maybe that's what it was for?), and
it was sometimes assigned an arbitrary value like 0xbeef.

Signed-off-by: Ben Pfaff <blp@feldera.com>
Signed-off-by: Ben Pfaff <blp@feldera.com>
An upcoming commit will change the way the spine communicates with the
background merger.  For that purpose, it makes sense for all the logic
for working with the background threads to be integrated into the merger
itself rather than sprinkled into the runtime as well.  This commit does
that.

Signed-off-by: Ben Pfaff <blp@feldera.com>
The existing asynchronous merger works in terms of one merger at a time.
The spine sends it a merge task and receives back the merge results.  Since
it does it one at a time, the spine can only start a new merge when it gets
a chance to run itself.  That limits the merge rate in significant ways.

This commit rewrites the merger to be fully asynchronous.  The spine hands
the merger all of its batches.  The merger thread then merges all of those
batches.  When the spine gets a chance to run, it fetches the new
collection of batches, which might have completely changed in the meantime.
This produces much better performance.

Signed-off-by: Ben Pfaff <blp@feldera.com>
@blp blp added enhancement DBSP core Related to the core DBSP library storage Persistence for internal state in DBSP operators high priority Task should be tackled first, added in the current sprint if necessary labels Jul 29, 2024
@blp blp requested review from gz and ryzhyk July 29, 2024 04:22
@blp blp self-assigned this Jul 29, 2024
@blp
Copy link
Copy Markdown
Member Author

blp commented Jul 29, 2024

@blp
Copy link
Copy Markdown
Member Author

blp commented Jul 29, 2024

Here's the Nexmark SQL test results for 100M events before this PR:

┌─────┬───────────────────────┬─────────────────────────┐
│     │        runtime        │    peak memory usage    │
│     ├────────┬────────┬─────┼─────────┬─────────┬─────┤
│     │storage │ storage│     │ storage │ storage │     │
│query│disabled│ enabled│ diff│ disabled│ enabled │ diff│
├─────┼────────┼────────┼─────┼─────────┼─────────┼─────┤
|   q4|  62.3 s|  97.0 s|1.56×| 37.0 GiB| 18.2 GiB|0.49×|
|   q7|  92.7 s| 133.5 s|1.44×| 26.7 GiB| 12.8 GiB|0.48×|
|   q9| 122.7 s| 124.7 s|1.02×|114.2 GiB| 47.1 GiB|0.41×|
|  q15| 208.6 s| 325.6 s|1.56×| 10.0 GiB|  4.0 GiB|0.41×|
|  q17|  35.7 s|  49.0 s|1.37×| 15.8 GiB| 12.9 GiB|0.81×|
|  q18|  59.2 s|  62.3 s|1.05×| 50.6 GiB| 23.1 GiB|0.46×|
|  q19|  61.2 s|  57.2 s|0.93×| 51.6 GiB| 25.3 GiB|0.49×|
|  q20|  55.1 s|  57.1 s|1.04×| 40.3 GiB| 18.0 GiB|0.45×|
├─────┼────────┼────────┼─────┼─────────┼─────────┼─────┤
|   q0|  39.8 s|  37.7 s|0.95×|  0.3 GiB|  0.3 GiB|1.01×|
|   q1|  38.7 s|  37.7 s|0.97×|  0.2 GiB|  0.3 GiB|1.06×|
|   q2|  39.7 s|  40.7 s|1.03×|  0.2 GiB|  0.2 GiB|0.96×|
├─────┼────────┼────────┼─────┼─────────┼─────────┼─────┤
|   q3|  26.5 s|  29.6 s|1.12×|  3.7 GiB|  4.3 GiB|1.16×|
|   q8|  26.5 s|  23.4 s|0.88×|  2.5 GiB|  2.7 GiB|1.08×|
|  q10|  31.6 s|  39.7 s|1.26×|  0.4 GiB|  1.0 GiB|2.65×|
└─────┴────────┴────────┴─────┴─────────┴─────────┴─────┘

and after:

┌─────┬───────────────────────┬─────────────────────────┐
│     │        runtime        │    peak memory usage    │
│     ├────────┬────────┬─────┼─────────┬─────────┬─────┤
│     │storage │ storage│     │ storage │ storage │     │
│query│disabled│ enabled│ diff│ disabled│ enabled │ diff│
├─────┼────────┼────────┼─────┼─────────┼─────────┼─────┤
│   q4│  54.1 s│  56.1 s│1.04×│  9.7 GiB│ 11.2 GiB│1.16×│
│   q7│  84.5 s│ 112.0 s│1.33×│  9.5 GiB│ 14.6 GiB│1.54×│
│   q9│  57.2 s│  90.9 s│1.59×│ 19.4 GiB│ 18.9 GiB│0.97×│
│  q15│ 223.9 s│ 337.8 s│1.51×│ 10.4 GiB│  4.1 GiB│0.40×│
│  q17│  38.8 s│  39.7 s│1.02×│ 13.3 GiB│  3.3 GiB│0.25×│
│  q18│  55.1 s│  51.0 s│0.93×│ 15.8 GiB│  4.8 GiB│0.30×│
│  q19│  57.2 s│  51.1 s│0.89×│ 50.3 GiB│ 13.6 GiB│0.27×│
│  q20│  55.1 s│  43.9 s│0.80×│ 38.8 GiB│  8.6 GiB│0.22×│
├─────┼────────┼────────┼─────┼─────────┼─────────┼─────┤
│   q0│  32.6 s│  32.6 s│1.00×│  0.3 GiB│  0.2 GiB│0.97×│
│   q1│  37.7 s│  34.6 s│0.92×│  0.2 GiB│  0.2 GiB│1.04×│
│   q2│  35.7 s│  38.7 s│1.08×│  0.2 GiB│  0.2 GiB│0.98×│
├─────┼────────┼────────┼─────┼─────────┼─────────┼─────┤
│   q3│  27.5 s│  21.5 s│0.78×│  3.1 GiB│  4.1 GiB│1.31×│
│   q8│  20.4 s│  25.5 s│1.25×│  1.9 GiB│  3.3 GiB│1.76×│
│  q10│  32.6 s│  33.6 s│1.03×│  0.4 GiB│  0.7 GiB│2.00×│
└─────┴────────┴────────┴─────┴─────────┴─────────┴─────┘

@lalithsuresh
Copy link
Copy Markdown
Contributor

lalithsuresh commented Jul 29, 2024

@blp Nice! Looks q9's memory usage has dropped dramatically with this PR. q15 seems to have gotten slightly slower. There seems to be this weird thing where we're faster with storage in several cases, but I assume that's measurement variability.

@blp
Copy link
Copy Markdown
Member Author

blp commented Jul 29, 2024

@blp Nice! Looks q9's memory usage has dropped dramatically with this PR.

Yep! Most of them dropped.

q15 seems to have gotten slightly slower.

q15 is really odd. It is by far the slowest query, even though it's a simple query. I plan to investigate it at some point.

q15 is just:

CREATE VIEW q15 AS
SELECT
     FORMAT_DATE('yyyy-MM-dd', date_time) as 'day',
     count(*) AS total_bids,
     count(*) filter (where price < 10000) AS rank1_bids,
     count(*) filter (where price >= 10000 and price < 1000000) AS rank2_bids,
     count(*) filter (where price >= 1000000) AS rank3_bids,
     count(distinct bidder) AS total_bidders,
     count(distinct bidder) filter (where price < 10000) AS rank1_bidders,
     count(distinct bidder) filter (where price >= 10000 and price < 1000000) AS rank2_bidders,
     count(distinct bidder) filter (where price >= 1000000) AS rank3_bidders,
     count(distinct auction) AS total_auctions,
     count(distinct auction) filter (where price < 10000) AS rank1_auctions,
     count(distinct auction) filter (where price >= 10000 and price < 1000000) AS rank2_auctions,
     count(distinct auction) filter (where price >= 1000000) AS rank3_auctions
FROM bid
GROUP BY FORMAT_DATE('yyyy-MM-dd', date_time);

There seems to be this weird thing where we're faster with storage in several cases, but I assume that's measurement variability.

It's fairly consistent in some cases. I haven't investigated, so I can only guess, and your guess is probably as good as mine at this point.

@gz
Copy link
Copy Markdown
Contributor

gz commented Jul 29, 2024

The one thing I noticed about q15 which is different from others is that it seems to create a lot more spines

@lalithsuresh
Copy link
Copy Markdown
Contributor

Probably worth looking at the query plan and profile for oddities (but separately from this PR).

@mihaibudiu
Copy link
Copy Markdown
Contributor

I will take a look at the plan. But each distinct requires a spine, and each aggregate requires 2.

@blp
Copy link
Copy Markdown
Member Author

blp commented Jul 29, 2024

The one thing I noticed about q15 which is different from others is that it seems to create a lot more spines

That's a useful observation. Maybe it's an opportunity to tune for that case. For example, maybe the per-step fuel should be smaller if there's a lot of spines, or maybe my big comment

            // XXX This always inserts the new batch asynchronously. This is
            // usually what we want to do, but it means that in theory we could
            // continue building up batches until we run out of memory (or disk
            // space), if merging is slow. Thus, we should figure out at what
            // point it makes sense to wait until the backlog is reduced. We
            // wouldn't have to just block; instead, we could grab some of the
            // loose batches and do some merging ourselves while we wait,
            // thereby actually speeding up the merges by devoting two threads
            // instead of just one.

is relevant here.

We can't allow empty batches to enter the asynchronous spine,
because the `CursorList` that the spine uses does not tolerate
empty batches.  If it did tolerate them, then they'd waste
time anyhow.

We didn't intend to allow empty batches in the spine, but
there were two cases where it could happen.  First, a merge
could produce an empty batch.  Second, `truncate_keys_below`
and `recede_to` could insert empty batches via `resume`.
This fixes both cases.

This also renames `add_batch` to clarify how it's different
from `add_batches`.

Thanks to Lalith for reporting the problem and Gerd for diagnosing it.

Signed-off-by: Ben Pfaff <blp@feldera.com>
@mihaibudiu
Copy link
Copy Markdown
Contributor

This is Q15's plan:
final

@mihaibudiu
Copy link
Copy Markdown
Contributor

everything red has state, i.e., spine

@mihaibudiu
Copy link
Copy Markdown
Contributor

I don't really see anything that could be improved here

@mihaibudiu
Copy link
Copy Markdown
Contributor

this is without LATENESS

@lalithsuresh
Copy link
Copy Markdown
Contributor

Would multi-join help?

level: usize,
/// A key that is unique within the level.
/// Invariant: the batches (if present) must be non-empty.
merging_batches: Option<[Arc<B>; 2]>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like we can only process one merge per-level at any time? Is it a problem for small size classes?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like we can only process one merge per-level at any time?

Yes.

Is it a problem for small size classes?

I don't think so. My theory is that merging has two benefits:

  1. To make iteration and searching cheaper because we have fewer batches to iterate and search in our CursorLists.
  2. In some cases only, to reduce the total amount of data, because weights summarize multiple items or cancel each other out or because filters drop data.

We only get these benefits when we complete a merge. An ongoing but incomplete merge only has a cost (the CPU we invested in it, plus memory or storage), with no benefits. Therefore, it is to our benefit to both complete as many merges as possible and to have as few ongoing merges as possible.

Since the smallest merges are cheapest, one might conclude that we should only do one merge at a time and that that should be the two smallest batches. In a static scenario, that would indeed be the right choice. But we tend to be getting a new smallest batch on every step. With that, the strategy of always doing the smallest merge piles up larger batches that never get merged. For example, suppose that we add a new batch with size 1 in every step, we start merging the smallest runs whenever we first a merge, and that a merge takes two steps. Then we end up with something like this, where each line is a step that adds a new batch of size 1, () designates that batches are being merged, and -> shows that a merge was finished or started within the step:

1
1 1 -> (1 1)
1 (1 1)
1 1 (1 1) -> 1 1 2 -> 2 (1 1)
1 2 (1 1)
1 1 2 (1 1) -> 1 1 2 2 -> 2 2 (1 1)
1 2 2 (1 1)
1 1 2 2 (1 1) -> 1 1 2 2 2 -> 2 2 2 (1 1)

Basically we pile up runs that are slightly longer than the shortest and they never get merged.

So, we still want to complete as many merges as possible and to have as few ongoing merges as possible, but "as few ongoing merges as possible" needs to be more than one and needs to include merges that are bigger than the smallest possible merge. The new merge design does both. I don't know whether it is the best possible, it's only the one that I hit on first.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "fuel" assigned for merges at each step should be log N, where N is the total size of the set.
This is the amortized cost you will have to eventually pay.
So you shouldn't stop merging with the small batches of size 1 if there is anything else that can be merged.
But sometimes there won't be any candidate for merging yet.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "fuel" assigned for merges at each step should be log N, where N is the total size of the set. This is the amortized cost you will have to eventually pay. So you shouldn't stop merging with the small batches of size 1 if there is anything else that can be merged. But sometimes there won't be any candidate for merging yet.

I don't think this answers anything though. We're not assigning fuel to steps, because merges and steps are asynchronous. Instead, the question is, at a given time, which merge or merges should we work on and how much? A merge of batches with length A and B has cost O(A+B) but the benefit when it completes is O(1) regardless of A and B (unless the merged version has length less than A+B).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am saying that if you are doing less than work log(N) you will have to pay for it later.
Ideally you always do around log(N) work in a merge.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am saying that if you are doing less than work log(N) you will have to pay for it later. Ideally you always do around log(N) work in a merge.

Can you recommend any reading material? I don't know how to apply this or where it comes from.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third benefit is that during merging we will GC old data flagged by the garbage collector. This will be most effective if we prioritize merging batches with the oldest data.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third benefit is that during merging we will GC old data flagged by the garbage collector. This will be most effective if we prioritize merging batches with the oldest data.

I think that's covered by my case 2.

@mihaibudiu
Copy link
Copy Markdown
Contributor

Would multi-join help?

All these joins are joining on the empty key since these are global aggregates.
I expect a multi-join version of that won't help.

@mihaibudiu
Copy link
Copy Markdown
Contributor

I mean, there will be fewer spines, but all of them should be tiny

@ryzhyk
Copy link
Copy Markdown
Contributor

ryzhyk commented Jul 29, 2024

I don't really see anything that could be improved here

  • Shouldn't we be able to use only two distinct's here?
  • Can we combine several aggregates into one?
  • multiway join would definitely help too

@mihaibudiu
Copy link
Copy Markdown
Contributor

I don't really see anything that could be improved here

* Shouldn't we be able to use only two distinct's here?

Each distinct is applied to a different value (a function of the data). I don't see how you can can combine them.

* Can we combine several aggregates into one?

The aggregate different sets. Moreover, most aggregates are linear. When we improve our aggregate implementations all aggregates will look like this.

* multiway join would definitely help too

Each join has 1 input and 1 output record.

@mihaibudiu
Copy link
Copy Markdown
Contributor

I expect all memory consumption is in the integrators for the distinct operators.
Only lateness information may help that.
Maybe @blp already runs the query with lateness.

@blp
Copy link
Copy Markdown
Member Author

blp commented Jul 29, 2024

I expect all memory consumption is in the integrators for the distinct operators. Only lateness information may help that. Maybe @blp already runs the query with lateness.

I run them with:

CREATE TABLE person (
    id BIGINT,
    name VARCHAR,
    emailAddress VARCHAR,
    creditCard VARCHAR,
    city VARCHAR,
    state VARCHAR,
    date_time TIMESTAMP(3) NOT NULL LATENESS INTERVAL 4 SECONDS,
    extra  VARCHAR
);
CREATE TABLE auction (
    id  BIGINT,
    itemName  VARCHAR,
    description  VARCHAR,
    initialBid  BIGINT,
    reserve  BIGINT,
    date_time  TIMESTAMP(3) NOT NULL LATENESS INTERVAL 4 SECONDS,
    expires  TIMESTAMP(3),
    seller  BIGINT,
    category  BIGINT,
    extra  VARCHAR
);
CREATE TABLE bid (
    auction  BIGINT,
    bidder  BIGINT,
    price  BIGINT,
    channel  VARCHAR,
    url  VARCHAR,
    date_time TIMESTAMP(3) NOT NULL LATENESS INTERVAL 4 SECONDS,
    extra  VARCHAR
);

@mihaibudiu
Copy link
Copy Markdown
Contributor

mihaibudiu commented Jul 29, 2024

This is an "infinite state" query, where none of the projections of the bid table on the "auctions" or "bidder" fields can be GC-ed due to the distinct. But hopefully these won't grow as fast as the "bids" table itself.
So the LATENESS doesn't help at all.
On a side note, we should make the ID fields primary keys if we can (they must be non-null), and we should mark the corresponding fields foreign keys, I plan to add optimizations that use this information.

Signed-off-by: Ben Pfaff <blp@feldera.com>
This includes a bug fix to `BackgroundThread::run` in the case where all
the workers exited: it would park itself instead of exiting.

Signed-off-by: Ben Pfaff <blp@feldera.com>
level: usize,
/// A key that is unique within the level.
/// Invariant: the batches (if present) must be non-empty.
merging_batches: Option<[Arc<B>; 2]>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third benefit is that during merging we will GC old data flagged by the garbage collector. This will be most effective if we prioritize merging batches with the oldest data.

blp added 2 commits July 30, 2024 10:31
Some of the tests in `crates/dbsp/src/operator/dynamic/input.rs` depended
on unspecified behavior in the spine implementation: the spine does not
guarantee that key and value filters are applied at any particular time,
but the tests relied on the particular behavior of the previous
implementation.

The tests using `input_map_updates1` relied on the spine not filtering
the value (1,91) immediately when the waterline was raised, and thus
the value was reported as removed.  The new spine sometimes filters it
that quickly, however, so the test became unreliable.  This commit changes
the test to use a different key (5 instead of 1) for raising the
waterline.  This keeps the map implementation from sometimes reporting a
removal.

The tests using `input_map_updates2` relied on the spine not filtering
the value (1,1) immediately when the waterline was raised.  The old spine
never did; the new spine sometimes does.  This commit comments out the
part of the test that relied on that behavior.

See #2119.

Signed-off-by: Ben Pfaff <blp@feldera.com>
The new spine is likely to start merging before the filter gets updated,
if a batch is inserted before a filter update.  This commit reverses the
order so that the filter will definitely be correct.

This fails in CI but not on my local machine so it's hard to test.

Signed-off-by: Ben Pfaff <blp@feldera.com>
@blp blp merged commit 12c672a into main Jul 30, 2024
@blp blp deleted the optimization branch July 30, 2024 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DBSP core Related to the core DBSP library high priority Task should be tackled first, added in the current sprint if necessary storage Persistence for internal state in DBSP operators

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants