feat: batch connection logs to avoid DB lock contention#23727
feat: batch connection logs to avoid DB lock contention#23727
Conversation
…nd, never drop logs - Moved batcher logic from coderd/connectionlogbatcher into the existing enterprise/coderd/connectionlog package, replacing the separate dbBackend and batchDBBackend with a single DBBackend that batches writes. - Changed Add() to Upsert() that blocks when the buffer is full instead of dropping entries, ensuring no connection logs are ever lost. - Increased default batch size from 500 to 1000. - Added Close() to ConnectionLogger that closes all io.Closer backends. - Wired shutdown in enterprise/coderd/coderd.go to flush on close. - Deleted coderd/connectionlogbatcher/ package.
…ead of flush Replace the batch slice with a map keyed by conflict columns so deduplication happens incrementally as items arrive rather than in a separate pass during flush. Null-connection-ID entries (web events) still go into a separate slice since NULL != NULL in SQL unique constraints.
Each new connection generates a fresh UUID, so the only duplicate for the same conflict key is a connect/disconnect pair for the same session. Reconnects always use a new connection ID.
…on tests Internal tests (connectionlog_internal_test.go): - TestAddToBatch: pure unit tests for dedup logic (connect/disconnect preference, same-status-keeps-later, null connection IDs, mixed). - TestDBBackend: mock DB tests using Close() for deterministic flush (dedup, null IDs, different connection IDs, close-flushes). Integration tests (connectionlog_test.go): - TestDBBackendIntegration: real PostgreSQL tests for single connect, connect+disconnect across batches, connect+disconnect in same batch, multiple independent connections, null connection ID web events, and close-flushes-to-DB. Also fixes BatchUpsertConnectionLogs SQL to use NULLIF for nullable columns (code, user_agent, user_id, slug_or_port, connection_id, disconnect_reason, disconnect_time) so Go zero values are stored as NULL rather than sentinel values.
- Rename NewConnectionLogger to New (package is already connectionlog).
- Rename DBBackend to DBBatcher.
- Replace 'for { select { case <-ctx.Done() } }' with
'for ctx.Err() == nil { select }' and move drain logic outside the
loop, eliminating the goto statement.
- Remove null converter functions (nullStringToString, nullInt32ToInt32,
nullUUIDToUUID) — Go zero values for sql.Null* types already produce
the right values, and the SQL uses NULLIF to convert them to NULL.
- Fix test names to match cryptokeys pattern (Test_addToBatch,
Test_batcherFlush).
- Move test helpers to bottom of file.
- Fix lint: unused receivers, unused types, fmt.Sprintf without args.
- Add batchParamsMatcher that checks both mustContainIDs and
mustNotContainIDs so dedup tests verify the disconnect event won
(not just that only 1 event was flushed).
- CloseFlushesMultipleEvents test uses n=2 events.
- Add BatchUpsertConnectionLogs to dbauthz MethodTestSuite.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 739ca43737
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
1. Upsert fails deterministically after Close: check b.ctx.Err() before attempting the channel send so post-cancel calls never silently succeed due to channel capacity. 2. NULLIF(code, 0) replaced with code_valid boolean array: the old NULLIF converted legitimate zero exit codes into NULL, making successful disconnections indistinguishable from missing codes. The SQL now uses a subquery with a CASE on a validity flag so nullability follows the Go sql.NullInt32.Valid field. 3. Batch retained on flush failure: previously the batch was cleared before the DB call, so any transient error (timeout, connection hiccup) silently dropped up to maxBatchSize logs. Now the batch is only cleared after a successful write, so failed flushes are retried on the next cycle.
…nded growth When the DB is persistently unavailable, failed flushes retain the batch for retry but new items keep arriving. Cap the in-memory batch at maxBatchSize and reject new entries when full, preventing unbounded memory growth.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 577764438e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if item.ConnectionStatus == database.ConnectionStatusDisconnected && | ||
| existing.ConnectionStatus != database.ConnectionStatusDisconnected { | ||
| b.dedupedBatch[key] = item | ||
| } else if item.Time.After(existing.Time) { | ||
| b.dedupedBatch[key] = item |
There was a problem hiding this comment.
Preserve connect timestamp when deduplicating disconnect events
When a connect and disconnect for the same (connection_id, workspace_id, agent_name) land in one batch window, this branch replaces the stored connect event with the disconnect event. flush() later uses the retained item's Time for connect_time, so these rows record the disconnect timestamp as the start time. That corrupts duration/ordering semantics for short-lived sessions that complete before the next flush.
Useful? React with 👍 / 👎.
| if b.batchLen() >= b.maxBatchSize { | ||
| b.log.Warn(b.ctx, "connection log batch full, rejecting entry", | ||
| slog.F("batch_size", b.batchLen()), | ||
| ) | ||
| return |
There was a problem hiding this comment.
Avoid dropping queued logs when retained batch is full
This guard drops incoming events once batchLen() reaches maxBatchSize. In the same codepath, flush() keeps the current batch on write errors for retry, so after one failed full flush all subsequent logs are rejected here even though Upsert() already accepted them into the channel and returned nil. During transient DB issues this causes silent audit-log loss instead of backpressure or explicit failure propagation.
Useful? React with 👍 / 👎.
… revert retain-on-failure Two fixes: 1. When a connect and disconnect for the same session land in one batch window, the dedup previously overwrote the connect event entirely, causing flush() to use the disconnect timestamp as connect_time. Introduced batchEntry wrapper with explicit connectTime and disconnectTime fields so merged entries preserve the original session start time. 2. Reverted the 'retain batch on failure' behavior. When a flush fails, the full batch stays in memory and blocks all new entries from being processed (addToBatch keeps appending but nothing clears). This caused cascading silent audit-log loss. The batch is now always cleared after flush, with the dropped count logged at error level. Also removed the addToBatch capacity cap which was a band-aid for the retain-on-failure issue.
Running 30k connections was generating a ton of lock contention in the DB
fixes https://github.com/coder/scaletest/issues/106