Skip to content

Limit number of concurrent database connections used by prebuilds #20606

@ssncferreira

Description

@ssncferreira

Problem

The prebuilds reconciliation loop uses an advisory lock to ensure only one replica executes reconciliation at a time. During reconciliation, we spawn one goroutine per preset to calculate and execute reconciliation actions.

While most goroutines result in 0 reconciliation actions (no database operations), we can still end up with many concurrent database write operations (creates/deletes of prebuilds). This means we can easily exceed the maximum number of open concurrent database connections per replica (10 connections). This can exhaust the database connection pool, causing both these goroutines and other parts of the system (such as API endpoints) to wait for available connections.

Proposed Solution

1) Filter presets earlier (optimization)

Update GetTemplatePresetsWithPrebuilds to filter out inactive/deleted template versions at the database level. Only include inactive/deleted versions if they have prebuilds that still need reconciliation (e.g., running workspaces that need to be deleted). This would significantly reduce the number of presets returned and consequently the number of goroutines spawned.

For reference, a preset in this context is a tuple (organization, template, version, preset name). On dogfood we currently have 3121 presets returned by GetTemplatePresetsWithPrebuilds, the majority of which correspond to old presets that no longer need reconciliation. Once a version becomes inactive, in a normal scenario, it should take 1 reconciliation cycle to clean up the prebuilds for that version.

2) Limit concurrent write operations

Limit the number of concurrent write operations to 2-3 connections for reconciliation to avoid impacting other parts of the system that require database connections.

Option A: Use errgroup concurrency limiting

Add eg.SetLimit(5) to limit concurrent goroutines. This is simple (one line of code) and allows reasonable parallelism, but limits all goroutines, even those that don't need database connections, most goroutines with no actions will hold a slot but do no database operation. However, this impact should be significantly reduced by the filtering optimization above.

Option B: Separate worker pool for writes

Keep goroutines for calculating actions (fast, in-memory), but funnel actual write operations through a dedicated separate worker pool of 2-3 workers. This provides maximum parallelism for the calculation phase and precise control over database connection usage, but adds complexity with worker pool coordination.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions