sandboxset: thread-safe sandbox pool (issue #217)#425
sandboxset: thread-safe sandbox pool (issue #217)#425AmiBuch wants to merge 11 commits intoopen-lambda:mainfrom
sandboxset: thread-safe sandbox pool (issue #217)#425Conversation
…ace and microservice like architecture
tylerharter
left a comment
There was a problem hiding this comment.
I think I forgot to click submit on the feedback, sorry!
tylerharter
left a comment
There was a problem hiding this comment.
Good work! Getting closer.
| // Callers just ask for a sandbox and don't worry about whether it is | ||
| // freshly created or recycled from a previous request. | ||
| // | ||
| // Sandbox lifecycle inside a SandboxSet: |
There was a problem hiding this comment.
This is the API, they don't need to know about internals.
There was a problem hiding this comment.
The usage is good because that's what API users need. The internals in the flow diagram are not something users of your code should worry about. Those are internal details.
go/worker/sandboxset/sandboxset.go
Outdated
| set *sandboxSetImpl | ||
| Broken bool // public: caller sets true if request failed; Put will destroy instead of recycle | ||
| inUse bool // true when checked out; false when idle in pool | ||
| destroyed atomic.Bool // set atomically after Destroy(); guards against concurrent Close + Put |
There was a problem hiding this comment.
no atomic, then mutex of the containing sandbox set should be used to protect this.
There was a problem hiding this comment.
instead of destroyed, perhaps sb can just be nil?
go/worker/sandboxset/sandboxset.go
Outdated
| // sandboxSetImpl is the private concrete type returned by New. | ||
| // All mutable state is guarded by mu. | ||
| type sandboxSetImpl struct { | ||
| mu sync.Mutex |
There was a problem hiding this comment.
pattern I like:
things not protected by the lock
mu synt.Mutex // protects below members
things protected by the lock
|
|
||
| // claimIdle returns an idle ref from the pool, marking it inUse. | ||
| // Caller must hold s.mu. | ||
| func (s *sandboxSetImpl) claimIdle() *SandboxRef { |
There was a problem hiding this comment.
Recommendation: 3 paths
- sandbox ref with healthy sandbox
- sandbox ref with nil sandbox (because there was old one that got deleted)
- sandbox ref with nil sandbox (because we're adding to the pool)
When a sandbox ref comes back with nil sandbox, GetOrCreateUnpaused can be responsible for giving it a new sandbox.
There was a problem hiding this comment.
I was imagining those 3 paths in claimIdle. And I think claimIdle can do the locking. We can condense claimIdle, tryClaimIdle, and appendNilRef to just this function.
tylerharter
left a comment
There was a problem hiding this comment.
API is looking great! Lets iterate on the implementation a bit more.
| // Callers just ask for a sandbox and don't worry about whether it is | ||
| // freshly created or recycled from a previous request. | ||
| // | ||
| // Sandbox lifecycle inside a SandboxSet: |
There was a problem hiding this comment.
The usage is good because that's what API users need. The internals in the flow diagram are not something users of your code should worry about. Those are internal details.
| return r.set.Put(r.sb) | ||
| } | ||
|
|
||
| // Destroy removes the sandbox from its parent set and destroys it. |
There was a problem hiding this comment.
I still don't think we need it. Please justify why, or remove.
| closed bool | ||
| } | ||
|
|
||
| func newSandboxSet(cfg *Config) (*sandboxSetImpl, error) { |
There was a problem hiding this comment.
We should return errors for things we expect to happen during normal operation. For bugs, we actually want to just crash. So newSandboxSet should not return an error. You can panic if you get bad args. Or, just let the panic happen itself later when there is a nil deref.
|
|
||
| // New creates a SandboxSet from cfg. Returns an error if any of | ||
| // Pool, CodeDir, or ScratchDirs are missing. | ||
| func New(cfg *Config) (SandboxSet, error) { |
There was a problem hiding this comment.
No error, should always succeed (or crash).
| } | ||
|
|
||
| // Put returns the sandbox to its parent set, or destroys it if Broken is true. | ||
| func (r *SandboxRef) Put() error { |
There was a problem hiding this comment.
I like this being a very simple wrapper around the private put method in sandboxSetImpl. The code is easier to understand if locking is a detail internal to sandboxSetImpl, not split across multiple classes.
The only purpose/logic of this function: identifying which sandboxset should do the put.
| // Fast path: claim an idle ref with an existing sandbox via tryClaimIdle, then Unpause outside the lock. | ||
| // Slow path (nil ref): the claimed ref has no sandbox (either destroyed or newly added). A new sandbox is created outside the lock and assigned to the ref. | ||
| func (s *sandboxSetImpl) GetOrCreateUnpaused() (*SandboxRef, error) { | ||
| for { |
There was a problem hiding this comment.
Let's not do retry for now. lambdaInstance.go already has retry logic for if Unpause fails. When we use sandbox set at a higher layer, then we can make a coherent decision about the 1 place to do retry (in sandbox set, or in the user of sandbox set).
| nilRef = ref | ||
| } | ||
| } | ||
| if nilRef != nil { |
There was a problem hiding this comment.
The naming is confusing me here. Why wouldn't a nilRef be nil?
|
|
||
| // claimIdle returns an idle ref from the pool, marking it inUse. | ||
| // Caller must hold s.mu. | ||
| func (s *sandboxSetImpl) claimIdle() *SandboxRef { |
There was a problem hiding this comment.
I was imagining those 3 paths in claimIdle. And I think claimIdle can do the locking. We can condense claimIdle, tryClaimIdle, and appendNilRef to just this function.
| } | ||
| } | ||
|
|
||
| s.mu.Lock() |
There was a problem hiding this comment.
Do we need locking? I was imagining the design would be that when not inUse, any thread might want to use it, so locking would be needed. But when inUse, it would effectively be reserved by a single thread, so no more locking until returned to not inUse.
| // GetOrCreateUnpaused implements SandboxSet. | ||
| // Fast path: claim an idle ref with an existing sandbox via tryClaimIdle, then Unpause outside the lock. | ||
| // Slow path (nil ref): the claimed ref has no sandbox (either destroyed or newly added). A new sandbox is created outside the lock and assigned to the ref. | ||
| func (s *sandboxSetImpl) GetOrCreateUnpaused() (*SandboxRef, error) { |
There was a problem hiding this comment.
Really, the logic should break into two steps:
- get a SandboxRef (with our without a Sandbox)
- make sure that SandboxRef has a healthy, unpaused Sandbox inside it
Step 1 requires locking, step 2 does not.
GetOrCreateUnpaused could be very short, calling functions corresponding to the two steps. Or, GetOrCreateUnpaused could call step 1, then do step 2 inline.
sandboxset: thread-safe sandbox pool (issue #217)API
GetOrCreateUnpausedreturns a*SandboxRef— a handle that wraps thesandbox with a back-pointer to its parent set. The caller uses
ref.Sandbox()to access the container. When done, setref.Broken = trueif the sandbox is unhealthy, then always call
ref.Put()— it recyclesthe sandbox if healthy or destroys it if broken.
For explicit teardown with a reason,
ref.Destroy("reason")is also available.SandboxRef
SandboxSet interface
Config
Flow
File structure
Dependencies
sandboxset is a thin layer on top of sandbox — no new abstractions:
Testing
Unit tests (3 tests): Use
MockSandboxPoolfromsandbox/mock.go.Fast, no Docker needed.
Integration tests (4 tests): Use real
DockerPoolwithol-minimage. Gated with
//go:build integration. Verify real containers arecreated, paused, unpaused, and destroyed.
Next steps
LambdaInstancegoroutines with aSandboxSetSandboxSetas the node in the zygote tree