-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Problem
The prebuilds reconciliation loop uses an advisory lock to ensure only one replica executes reconciliation at a time. During reconciliation, we spawn one goroutine per preset to calculate and execute reconciliation actions.
While most goroutines result in 0 reconciliation actions (no database operations), we can still end up with many concurrent database write operations (creates/deletes of prebuilds). This means we can easily exceed the maximum number of open concurrent database connections per replica (10 connections). This can exhaust the database connection pool, causing both these goroutines and other parts of the system (such as API endpoints) to wait for available connections.
Proposed Solution
1) Filter presets earlier (optimization)
Update GetTemplatePresetsWithPrebuilds to filter out inactive/deleted template versions at the database level. Only include inactive/deleted versions if they have prebuilds that still need reconciliation (e.g., running workspaces that need to be deleted). This would significantly reduce the number of presets returned and consequently the number of goroutines spawned.
For reference, a preset in this context is a tuple (organization, template, version, preset name). On dogfood we currently have 3121 presets returned by GetTemplatePresetsWithPrebuilds, the majority of which correspond to old presets that no longer need reconciliation. Once a version becomes inactive, in a normal scenario, it should take 1 reconciliation cycle to clean up the prebuilds for that version.
2) Limit concurrent write operations
Limit the number of concurrent write operations to 2-3 connections for reconciliation to avoid impacting other parts of the system that require database connections.
Option A: Use errgroup concurrency limiting
Add eg.SetLimit(5) to limit concurrent goroutines. This is simple (one line of code) and allows reasonable parallelism, but limits all goroutines, even those that don't need database connections, most goroutines with no actions will hold a slot but do no database operation. However, this impact should be significantly reduced by the filtering optimization above.
Option B: Separate worker pool for writes
Keep goroutines for calculating actions (fast, in-memory), but funnel actual write operations through a dedicated separate worker pool of 2-3 workers. This provides maximum parallelism for the calculation phase and precise control over database connection usage, but adds complexity with worker pool coordination.