feat: batch connection logs to avoid DB lock contention by sreya · Pull Request #23727 · coder/coder

sreya · 2026-03-27T19:04:36Z

Running 30k connections was generating a ton of lock contention in the DB

fixes https://github.com/coder/scaletest/issues/106

…horization

…nd, never drop logs - Moved batcher logic from coderd/connectionlogbatcher into the existing enterprise/coderd/connectionlog package, replacing the separate dbBackend and batchDBBackend with a single DBBackend that batches writes. - Changed Add() to Upsert() that blocks when the buffer is full instead of dropping entries, ensuring no connection logs are ever lost. - Increased default batch size from 500 to 1000. - Added Close() to ConnectionLogger that closes all io.Closer backends. - Wired shutdown in enterprise/coderd/coderd.go to flush on close. - Deleted coderd/connectionlogbatcher/ package.

…ead of flush Replace the batch slice with a map keyed by conflict columns so deduplication happens incrementally as items arrive rather than in a separate pass during flush. Null-connection-ID entries (web events) still go into a separate slice since NULL != NULL in SQL unique constraints.

Each new connection generates a fresh UUID, so the only duplicate for the same conflict key is a connect/disconnect pair for the same session. Reconnects always use a new connection ID.

…on tests Internal tests (connectionlog_internal_test.go): - TestAddToBatch: pure unit tests for dedup logic (connect/disconnect preference, same-status-keeps-later, null connection IDs, mixed). - TestDBBackend: mock DB tests using Close() for deterministic flush (dedup, null IDs, different connection IDs, close-flushes). Integration tests (connectionlog_test.go): - TestDBBackendIntegration: real PostgreSQL tests for single connect, connect+disconnect across batches, connect+disconnect in same batch, multiple independent connections, null connection ID web events, and close-flushes-to-DB. Also fixes BatchUpsertConnectionLogs SQL to use NULLIF for nullable columns (code, user_agent, user_id, slug_or_port, connection_id, disconnect_reason, disconnect_time) so Go zero values are stored as NULL rather than sentinel values.

- Rename NewConnectionLogger to New (package is already connectionlog). - Rename DBBackend to DBBatcher. - Replace 'for { select { case <-ctx.Done() } }' with 'for ctx.Err() == nil { select }' and move drain logic outside the loop, eliminating the goto statement. - Remove null converter functions (nullStringToString, nullInt32ToInt32, nullUUIDToUUID) — Go zero values for sql.Null* types already produce the right values, and the SQL uses NULLIF to convert them to NULL. - Fix test names to match cryptokeys pattern (Test_addToBatch, Test_batcherFlush). - Move test helpers to bottom of file. - Fix lint: unused receivers, unused types, fmt.Sprintf without args. - Add batchParamsMatcher that checks both mustContainIDs and mustNotContainIDs so dedup tests verify the disconnect event won (not just that only 1 event was flushed). - CloseFlushesMultipleEvents test uses n=2 events. - Add BatchUpsertConnectionLogs to dbauthz MethodTestSuite.

sreya · 2026-03-27T20:05:45Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 739ca43737

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

enterprise/coderd/connectionlog/connectionlog.go

coderd/database/queries/connectionlogs.sql

1. Upsert fails deterministically after Close: check b.ctx.Err() before attempting the channel send so post-cancel calls never silently succeed due to channel capacity. 2. NULLIF(code, 0) replaced with code_valid boolean array: the old NULLIF converted legitimate zero exit codes into NULL, making successful disconnections indistinguishable from missing codes. The SQL now uses a subquery with a CASE on a validity flag so nullability follows the Go sql.NullInt32.Valid field. 3. Batch retained on flush failure: previously the batch was cleared before the DB call, so any transient error (timeout, connection hiccup) silently dropped up to maxBatchSize logs. Now the batch is only cleared after a successful write, so failed flushes are retried on the next cycle.

…nded growth When the DB is persistently unavailable, failed flushes retain the batch for retry but new items keep arriving. Cap the in-memory batch at maxBatchSize and reject new entries when full, preventing unbounded memory growth.

sreya · 2026-03-27T20:48:22Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 577764438e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-27T20:55:19Z

enterprise/coderd/connectionlog/connectionlog.go

+	if item.ConnectionStatus == database.ConnectionStatusDisconnected &&
+		existing.ConnectionStatus != database.ConnectionStatusDisconnected {
+		b.dedupedBatch[key] = item
+	} else if item.Time.After(existing.Time) {
+		b.dedupedBatch[key] = item


Preserve connect timestamp when deduplicating disconnect events

When a connect and disconnect for the same (connection_id, workspace_id, agent_name) land in one batch window, this branch replaces the stored connect event with the disconnect event. flush() later uses the retained item's Time for connect_time, so these rows record the disconnect timestamp as the start time. That corrupts duration/ordering semantics for short-lived sessions that complete before the next flush.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-27T20:55:19Z

enterprise/coderd/connectionlog/connectionlog.go

+	if b.batchLen() >= b.maxBatchSize {
+		b.log.Warn(b.ctx, "connection log batch full, rejecting entry",
+			slog.F("batch_size", b.batchLen()),
+		)
+		return


Avoid dropping queued logs when retained batch is full

This guard drops incoming events once batchLen() reaches maxBatchSize. In the same codepath, flush() keeps the current batch on write errors for retry, so after one failed full flush all subsequent logs are rejected here even though Upsert() already accepted them into the channel and returned nil. During transient DB issues this causes silent audit-log loss instead of backpressure or explicit failure propagation.

Useful? React with 👍 / 👎.

… revert retain-on-failure Two fixes: 1. When a connect and disconnect for the same session land in one batch window, the dedup previously overwrote the connect event entirely, causing flush() to use the disconnect timestamp as connect_time. Introduced batchEntry wrapper with explicit connectTime and disconnectTime fields so merged entries preserve the original session start time. 2. Reverted the 'retain batch on failure' behavior. When a flush fails, the full batch stays in memory and blocks all new entries from being processed (addToBatch keeps appending but nothing clears). This caused cascading silent audit-log loss. The batch is now always cleared after flush, with the dropped count logged at error level. Also removed the addToBatch capacity cap which was a band-aid for the retain-on-failure issue.

sreya added 6 commits March 26, 2026 23:11

feat(coderd/connectionlogbatcher): add batched connection log writer

268849b

fix(coderd/database/dbauthz): implement BatchUpsertConnectionLogs aut…

23eef62

…horization

docs(enterprise/coderd/connectionlog): clarify why dedup is safe

896bb51

Each new connection generates a fresh UUID, so the only duplicate for the same conflict key is a connect/disconnect pair for the same session. Reconnects always use a new connection ID.

github-actions bot assigned sreya Mar 27, 2026

sreya added 2 commits March 27, 2026 19:28

fix: remove underscore receiver name per revive lint

739ca43

chatgpt-codex-connector bot reviewed Mar 27, 2026

View reviewed changes

enterprise/coderd/connectionlog/connectionlog.go Outdated Show resolved Hide resolved

enterprise/coderd/connectionlog/connectionlog.go Show resolved Hide resolved

coderd/database/queries/connectionlogs.sql Outdated Show resolved Hide resolved

sreya added 2 commits March 27, 2026 20:25

sreya force-pushed the connection-logs branch from 9c26f9c to 5777644 Compare March 27, 2026 20:46

chatgpt-codex-connector bot reviewed Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: batch connection logs to avoid DB lock contention#23727

feat: batch connection logs to avoid DB lock contention#23727
sreya wants to merge 11 commits intomainfrom
connection-logs

sreya commented Mar 27, 2026 •

edited

Loading

Uh oh!

sreya commented Mar 27, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sreya commented Mar 27, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 27, 2026

Uh oh!

chatgpt-codex-connector bot Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sreya commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sreya commented Mar 27, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sreya commented Mar 27, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sreya commented Mar 27, 2026 •

edited

Loading