-
Notifications
You must be signed in to change notification settings - Fork 108
Open
Labels
DBSP coreRelated to the core DBSP libraryRelated to the core DBSP libraryRFCRequest for CommentsRequest for CommentsstoragePersistence for internal state in DBSP operatorsPersistence for internal state in DBSP operators
Description
Some possible merge policies:
- If we always merge the smallest batches of size k in a slot, we will pile up batches of size 2k in that slot if new batches of size k keep getting added.
- If we always merge the largest batches, that's inefficient.
- We currently always merge the batches most recently added. The merge result (if it goes to the same slot) becomes the most recently added. So we'll end doing merges of kn + k => k(n+1) until that overfills the slot, if we keep getting new batches of size k. Not ideal either.
- Another policy would be to merge the two least recently added batches. Then we'll do k+k=>2k, ..., k+k=>2k, 2k+2k=>4k, ..., 2k+2k=>4k, 4k+4k=>8k, ... and so on. It could also be good for GC to work with older batches (as you observed). That might be a good policy. (It could be bad for cache locality, since we're working with the oldest data.)
- Another variation would be to merge the least recently added batch with the other batch closest in size. I don't have an intuition about this.
Originally posted by @blp in #2115 (comment)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
DBSP coreRelated to the core DBSP libraryRelated to the core DBSP libraryRFCRequest for CommentsRequest for CommentsstoragePersistence for internal state in DBSP operatorsPersistence for internal state in DBSP operators