Skip to content

Experiment with merge policies #2124

@blp

Description

@blp

Some possible merge policies:

  • If we always merge the smallest batches of size k in a slot, we will pile up batches of size 2k in that slot if new batches of size k keep getting added.
  • If we always merge the largest batches, that's inefficient.
  • We currently always merge the batches most recently added. The merge result (if it goes to the same slot) becomes the most recently added. So we'll end doing merges of kn + k => k(n+1) until that overfills the slot, if we keep getting new batches of size k. Not ideal either.
  • Another policy would be to merge the two least recently added batches. Then we'll do k+k=>2k, ..., k+k=>2k, 2k+2k=>4k, ..., 2k+2k=>4k, 4k+4k=>8k, ... and so on. It could also be good for GC to work with older batches (as you observed). That might be a good policy. (It could be bad for cache locality, since we're working with the oldest data.)
  • Another variation would be to merge the least recently added batch with the other batch closest in size. I don't have an intuition about this.

Originally posted by @blp in #2115 (comment)

Metadata

Metadata

Assignees

Labels

DBSP coreRelated to the core DBSP libraryRFCRequest for CommentsstoragePersistence for internal state in DBSP operators

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions