Skip to content

feat: batch connection logs to avoid DB lock contention#23727

Draft
sreya wants to merge 11 commits intomainfrom
connection-logs
Draft

feat: batch connection logs to avoid DB lock contention#23727
sreya wants to merge 11 commits intomainfrom
connection-logs

Conversation

@sreya
Copy link
Copy Markdown
Collaborator

@sreya sreya commented Mar 27, 2026

Running 30k connections was generating a ton of lock contention in the DB

fixes https://github.com/coder/scaletest/issues/106

sreya added 6 commits March 26, 2026 23:11
…nd, never drop logs

- Moved batcher logic from coderd/connectionlogbatcher into the existing
  enterprise/coderd/connectionlog package, replacing the separate dbBackend
  and batchDBBackend with a single DBBackend that batches writes.
- Changed Add() to Upsert() that blocks when the buffer is full instead of
  dropping entries, ensuring no connection logs are ever lost.
- Increased default batch size from 500 to 1000.
- Added Close() to ConnectionLogger that closes all io.Closer backends.
- Wired shutdown in enterprise/coderd/coderd.go to flush on close.
- Deleted coderd/connectionlogbatcher/ package.
…ead of flush

Replace the batch slice with a map keyed by conflict columns so
deduplication happens incrementally as items arrive rather than in a
separate pass during flush. Null-connection-ID entries (web events)
still go into a separate slice since NULL != NULL in SQL unique
constraints.
Each new connection generates a fresh UUID, so the only duplicate
for the same conflict key is a connect/disconnect pair for the same
session. Reconnects always use a new connection ID.
…on tests

Internal tests (connectionlog_internal_test.go):
- TestAddToBatch: pure unit tests for dedup logic (connect/disconnect
  preference, same-status-keeps-later, null connection IDs, mixed).
- TestDBBackend: mock DB tests using Close() for deterministic flush
  (dedup, null IDs, different connection IDs, close-flushes).

Integration tests (connectionlog_test.go):
- TestDBBackendIntegration: real PostgreSQL tests for single connect,
  connect+disconnect across batches, connect+disconnect in same batch,
  multiple independent connections, null connection ID web events, and
  close-flushes-to-DB.

Also fixes BatchUpsertConnectionLogs SQL to use NULLIF for nullable
columns (code, user_agent, user_id, slug_or_port, connection_id,
disconnect_reason, disconnect_time) so Go zero values are stored as
NULL rather than sentinel values.
sreya added 2 commits March 27, 2026 19:28
- Rename NewConnectionLogger to New (package is already connectionlog).
- Rename DBBackend to DBBatcher.
- Replace 'for { select { case <-ctx.Done() } }' with
  'for ctx.Err() == nil { select }' and move drain logic outside the
  loop, eliminating the goto statement.
- Remove null converter functions (nullStringToString, nullInt32ToInt32,
  nullUUIDToUUID) — Go zero values for sql.Null* types already produce
  the right values, and the SQL uses NULLIF to convert them to NULL.
- Fix test names to match cryptokeys pattern (Test_addToBatch,
  Test_batcherFlush).
- Move test helpers to bottom of file.
- Fix lint: unused receivers, unused types, fmt.Sprintf without args.
- Add batchParamsMatcher that checks both mustContainIDs and
  mustNotContainIDs so dedup tests verify the disconnect event won
  (not just that only 1 event was flushed).
- CloseFlushesMultipleEvents test uses n=2 events.
- Add BatchUpsertConnectionLogs to dbauthz MethodTestSuite.
@sreya
Copy link
Copy Markdown
Collaborator Author

sreya commented Mar 27, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 739ca43737

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

sreya added 2 commits March 27, 2026 20:25
1. Upsert fails deterministically after Close: check b.ctx.Err()
   before attempting the channel send so post-cancel calls never
   silently succeed due to channel capacity.

2. NULLIF(code, 0) replaced with code_valid boolean array: the old
   NULLIF converted legitimate zero exit codes into NULL, making
   successful disconnections indistinguishable from missing codes.
   The SQL now uses a subquery with a CASE on a validity flag so
   nullability follows the Go sql.NullInt32.Valid field.

3. Batch retained on flush failure: previously the batch was cleared
   before the DB call, so any transient error (timeout, connection
   hiccup) silently dropped up to maxBatchSize logs. Now the batch
   is only cleared after a successful write, so failed flushes are
   retried on the next cycle.
…nded growth

When the DB is persistently unavailable, failed flushes retain the
batch for retry but new items keep arriving. Cap the in-memory batch
at maxBatchSize and reject new entries when full, preventing
unbounded memory growth.
@sreya sreya force-pushed the connection-logs branch from 9c26f9c to 5777644 Compare March 27, 2026 20:46
@sreya
Copy link
Copy Markdown
Collaborator Author

sreya commented Mar 27, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 577764438e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +229 to +233
if item.ConnectionStatus == database.ConnectionStatusDisconnected &&
existing.ConnectionStatus != database.ConnectionStatusDisconnected {
b.dedupedBatch[key] = item
} else if item.Time.After(existing.Time) {
b.dedupedBatch[key] = item
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve connect timestamp when deduplicating disconnect events

When a connect and disconnect for the same (connection_id, workspace_id, agent_name) land in one batch window, this branch replaces the stored connect event with the disconnect event. flush() later uses the retained item's Time for connect_time, so these rows record the disconnect timestamp as the start time. That corrupts duration/ordering semantics for short-lived sessions that complete before the next flush.

Useful? React with 👍 / 👎.

Comment on lines +207 to +211
if b.batchLen() >= b.maxBatchSize {
b.log.Warn(b.ctx, "connection log batch full, rejecting entry",
slog.F("batch_size", b.batchLen()),
)
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid dropping queued logs when retained batch is full

This guard drops incoming events once batchLen() reaches maxBatchSize. In the same codepath, flush() keeps the current batch on write errors for retry, so after one failed full flush all subsequent logs are rejected here even though Upsert() already accepted them into the channel and returned nil. During transient DB issues this causes silent audit-log loss instead of backpressure or explicit failure propagation.

Useful? React with 👍 / 👎.

… revert retain-on-failure

Two fixes:

1. When a connect and disconnect for the same session land in one
   batch window, the dedup previously overwrote the connect event
   entirely, causing flush() to use the disconnect timestamp as
   connect_time. Introduced batchEntry wrapper with explicit
   connectTime and disconnectTime fields so merged entries preserve
   the original session start time.

2. Reverted the 'retain batch on failure' behavior. When a flush
   fails, the full batch stays in memory and blocks all new entries
   from being processed (addToBatch keeps appending but nothing
   clears). This caused cascading silent audit-log loss. The batch
   is now always cleared after flush, with the dropped count logged
   at error level.

Also removed the addToBatch capacity cap which was a band-aid for
the retain-on-failure issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant