Skip to content

sandboxset: thread-safe sandbox pool (issue #217)#425

Open
AmiBuch wants to merge 11 commits intoopen-lambda:mainfrom
AmiBuch:main
Open

sandboxset: thread-safe sandbox pool (issue #217)#425
AmiBuch wants to merge 11 commits intoopen-lambda:mainfrom
AmiBuch:main

Conversation

@AmiBuch
Copy link
Copy Markdown

@AmiBuch AmiBuch commented Mar 3, 2026

sandboxset: thread-safe sandbox pool (issue #217)

API

GetOrCreateUnpaused returns a *SandboxRef — a handle that wraps the
sandbox with a back-pointer to its parent set. The caller uses
ref.Sandbox() to access the container. When done, set ref.Broken = true
if the sandbox is unhealthy, then always call ref.Put() — it recycles
the sandbox if healthy or destroys it if broken.

set, _ := sandboxset.New(&sandboxset.Config{
    Pool:        myPool,
    CodeDir:     "/path/to/lambda",
    ScratchDirs: myScratchDirs,
})

ref, _ := set.GetOrCreateUnpaused()   // create or reuse a sandbox
sb := ref.Sandbox()                    // access the underlying container
// ... handle request using sb.Client() ...

if broken {
    ref.Broken = true                  // mark unhealthy before returning
}
ref.Put()                              // recycles if healthy; destroys if Broken

set.Close()                            // tear down everything

For explicit teardown with a reason, ref.Destroy("reason") is also available.

SandboxRef

type SandboxRef struct {
	set    *sandboxSetImpl
	Broken bool
	sb    sandbox.Sandbox
	inUse bool
}

func (r *SandboxRef) Sandbox() sandbox.Sandbox    // access the container
func (r *SandboxRef) Put() error                   // return to pool (or destroy if Broken)
func (r *SandboxRef) Destroy(reason string) error  // explicitly destroy and remove

SandboxSet interface

type SandboxSet interface {
    GetOrCreateUnpaused() (*SandboxRef, error)
    Close() error
}

Config

type Config struct {
    Pool        sandbox.SandboxPool  // creates/destroys sandboxes
    Parent      SandboxSet           // optional parent set to fork from
    IsLeaf      bool                 // marks sandboxes as non-forkable
    CodeDir     string               // Lambda handler code directory
    Meta        *sandbox.SandboxMeta // runtime config (memory, packages, etc.)
    ScratchDirs *common.DirMaker     // creates a writable dir per new sandbox
}

Flow

     [created]
         |
         v
     [paused]  <---+
         |         |
         v         |
     [in-use]  ----+  (Put, Broken=false)
         |
         v
   [destroyed]     (Put with Broken=true / Destroy / Close / error)

File structure

go/worker/sandboxset/
    api.go           — SandboxSet interface, Config, New()
    sandboxset.go    — SandboxRef, sandboxSetImpl, all methods
    tests/
        sandboxset_test.go              — Unit tests (MockSandboxPool)
        sandboxset_integration_test.go  — Integration tests (real DockerPool)

Dependencies

sandboxset is a thin layer on top of sandbox — no new abstractions:

sandboxset
    │
    ├── sandbox.Sandbox      (3 of 11 methods used: Pause, Unpause, Destroy)
    ├── sandbox.SandboxPool  (1 method used: Create)
    ├── sandbox.SandboxMeta  (passed through to Create, never read)
    └── common.DirMaker      (1 method used: Make)

Testing

  • Unit tests (3 tests): Use MockSandboxPool from sandbox/mock.go.
    Fast, no Docker needed.

    cd go && go test ./worker/sandboxset/tests/ -v -race -count=1
    
  • Integration tests (4 tests): Use real DockerPool with ol-min
    image. Gated with //go:build integration. Verify real containers are
    created, paused, unpaused, and destroyed.

    cd go && go test ./worker/sandboxset/tests/ -v -tags=integration -count=1
    

Next steps

  • Step 2: Replace LambdaInstance goroutines with a SandboxSet
  • Step 3: Use SandboxSet as the node in the zygote tree

Copy link
Copy Markdown
Member

@tylerharter tylerharter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I forgot to click submit on the feedback, sorry!

Copy link
Copy Markdown
Member

@tylerharter tylerharter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Getting closer.

// Callers just ask for a sandbox and don't worry about whether it is
// freshly created or recycled from a previous request.
//
// Sandbox lifecycle inside a SandboxSet:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the API, they don't need to know about internals.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage is good because that's what API users need. The internals in the flow diagram are not something users of your code should worry about. Those are internal details.

Copy link
Copy Markdown
Member

@tylerharter tylerharter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

set *sandboxSetImpl
Broken bool // public: caller sets true if request failed; Put will destroy instead of recycle
inUse bool // true when checked out; false when idle in pool
destroyed atomic.Bool // set atomically after Destroy(); guards against concurrent Close + Put
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no atomic, then mutex of the containing sandbox set should be used to protect this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of destroyed, perhaps sb can just be nil?

// sandboxSetImpl is the private concrete type returned by New.
// All mutable state is guarded by mu.
type sandboxSetImpl struct {
mu sync.Mutex
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pattern I like:

things not protected by the lock
mu synt.Mutex // protects below members
things protected by the lock


// claimIdle returns an idle ref from the pool, marking it inUse.
// Caller must hold s.mu.
func (s *sandboxSetImpl) claimIdle() *SandboxRef {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommendation: 3 paths

  1. sandbox ref with healthy sandbox
  2. sandbox ref with nil sandbox (because there was old one that got deleted)
  3. sandbox ref with nil sandbox (because we're adding to the pool)

When a sandbox ref comes back with nil sandbox, GetOrCreateUnpaused can be responsible for giving it a new sandbox.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was imagining those 3 paths in claimIdle. And I think claimIdle can do the locking. We can condense claimIdle, tryClaimIdle, and appendNilRef to just this function.

@AmiBuch AmiBuch marked this pull request as ready for review March 24, 2026 19:17
Copy link
Copy Markdown
Member

@tylerharter tylerharter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API is looking great! Lets iterate on the implementation a bit more.

// Callers just ask for a sandbox and don't worry about whether it is
// freshly created or recycled from a previous request.
//
// Sandbox lifecycle inside a SandboxSet:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage is good because that's what API users need. The internals in the flow diagram are not something users of your code should worry about. Those are internal details.

return r.set.Put(r.sb)
}

// Destroy removes the sandbox from its parent set and destroys it.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think we need it. Please justify why, or remove.

closed bool
}

func newSandboxSet(cfg *Config) (*sandboxSetImpl, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should return errors for things we expect to happen during normal operation. For bugs, we actually want to just crash. So newSandboxSet should not return an error. You can panic if you get bad args. Or, just let the panic happen itself later when there is a nil deref.


// New creates a SandboxSet from cfg. Returns an error if any of
// Pool, CodeDir, or ScratchDirs are missing.
func New(cfg *Config) (SandboxSet, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No error, should always succeed (or crash).

}

// Put returns the sandbox to its parent set, or destroys it if Broken is true.
func (r *SandboxRef) Put() error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this being a very simple wrapper around the private put method in sandboxSetImpl. The code is easier to understand if locking is a detail internal to sandboxSetImpl, not split across multiple classes.

The only purpose/logic of this function: identifying which sandboxset should do the put.

// Fast path: claim an idle ref with an existing sandbox via tryClaimIdle, then Unpause outside the lock.
// Slow path (nil ref): the claimed ref has no sandbox (either destroyed or newly added). A new sandbox is created outside the lock and assigned to the ref.
func (s *sandboxSetImpl) GetOrCreateUnpaused() (*SandboxRef, error) {
for {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not do retry for now. lambdaInstance.go already has retry logic for if Unpause fails. When we use sandbox set at a higher layer, then we can make a coherent decision about the 1 place to do retry (in sandbox set, or in the user of sandbox set).

nilRef = ref
}
}
if nilRef != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming is confusing me here. Why wouldn't a nilRef be nil?


// claimIdle returns an idle ref from the pool, marking it inUse.
// Caller must hold s.mu.
func (s *sandboxSetImpl) claimIdle() *SandboxRef {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was imagining those 3 paths in claimIdle. And I think claimIdle can do the locking. We can condense claimIdle, tryClaimIdle, and appendNilRef to just this function.

}
}

s.mu.Lock()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need locking? I was imagining the design would be that when not inUse, any thread might want to use it, so locking would be needed. But when inUse, it would effectively be reserved by a single thread, so no more locking until returned to not inUse.

// GetOrCreateUnpaused implements SandboxSet.
// Fast path: claim an idle ref with an existing sandbox via tryClaimIdle, then Unpause outside the lock.
// Slow path (nil ref): the claimed ref has no sandbox (either destroyed or newly added). A new sandbox is created outside the lock and assigned to the ref.
func (s *sandboxSetImpl) GetOrCreateUnpaused() (*SandboxRef, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really, the logic should break into two steps:

  1. get a SandboxRef (with our without a Sandbox)
  2. make sure that SandboxRef has a healthy, unpaused Sandbox inside it

Step 1 requires locking, step 2 does not.

GetOrCreateUnpaused could be very short, calling functions corresponding to the two steps. Or, GetOrCreateUnpaused could call step 1, then do step 2 inline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants