Skip to content

fix(worker): add rows_dropped metric for ClickhouseWriter exhaust#13488

Merged
sumerman merged 7 commits into
langfuse:mainfrom
hello-args:fix/worker-clickhouse-writer-drop-metrics
May 12, 2026
Merged

fix(worker): add rows_dropped metric for ClickhouseWriter exhaust#13488
sumerman merged 7 commits into
langfuse:mainfrom
hello-args:fix/worker-clickhouse-writer-drop-metrics

Conversation

@hello-args
Copy link
Copy Markdown
Contributor

@hello-args hello-args commented May 6, 2026

Pull request description

Suggested PR title (Conventional Commits):

fix(worker): add rows_dropped metric for ClickhouseWriter exhaust


Emit langfuse.queue.clickhouse_writer.rows_dropped with an entity_type tag when rows are discarded after max flush attempts, alongside existing langfuse.queue.clickhouse_writer.error increments and logs.

What does this PR do?

PR title must follow Conventional Commits, for example feat(web): add trace filters or fix: handle empty dataset names.

Adds a structured Datadog/statsd counter for rows dropped from ClickhouseWriter after flush retries are exhausted, tagged by ClickHouse table / entity (entity_type). This makes drops easier to alert on and slice in dashboards (for example by traces versus observations) without changing retry or drop behavior, and without removing the existing per-row error increments or structured error logs.

Motivation: Improves operator discoverability for a path that is otherwise only visible through logs and a generic error counter (see discussion on #13468: async CH writes and S3 replay remain the durability model; this change only improves observability when drops occur).

Fixes #13468

N/A — no user-facing UI changes.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Chore (tooling, dependencies, CI, workflows, repo upkeep, or other maintenance work)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (restructures existing code without changing behavior, e.g. simplify logic, split modules, reduce duplication)
  • This change requires a documentation update

If you prefer to classify this as additive observability only, switch the checkmark to New feature and optionally retitle the PR to feat(worker): … so the title and checkbox stay aligned.

Mandatory Tasks

  • Make sure you have self-reviewed the code. A decent size PR without self-review might be rejected.

Checklist

  • I haven't read the contributing guide
  • My code doesn't follow the style guidelines of this project (pnpm run format)
  • I haven't commented my code, particularly in hard-to-understand areas
  • I haven't checked if my PR needs changes to the documentation
  • I haven't checked if my changes generate no new warnings (npm run lint)
  • I haven't added tests that prove my fix is effective or that my feature works
  • I haven't checked if new and existing unit tests pass locally with my changes

How to use this checklist (template is phrased as negatives): delete any line that is not true for you. Example if you did everything: delete all seven lines above, or replace them with the positive list below.

Positive checklist (optional replacement if you remove the default bullets):

  • I have read the contributing guide
  • My code follows the style guidelines of this project (pnpm run format where applicable)
  • Comments: N/A for this small change (no new non-obvious logic)
  • Documentation: not required for an internal metric; optional follow-up to mention the metric in ops/runbook docs
  • No new warnings from pnpm --filter worker run lint
  • Tests added/updated: ClickhouseWriter.unit.test.ts asserts the new recordIncrement call when rows are dropped after max attempts
  • Unit tests pass locally: cd worker && pnpm exec vitest run src/services/ClickhouseWriter/ClickhouseWriter.unit.test.ts

Impacted packages

  • worker

Verification commands

cd worker
pnpm exec vitest run src/services/ClickhouseWriter/ClickhouseWriter.unit.test.ts
pnpm --filter worker run lint
pnpm --filter worker run typecheck

If Vitest fails on missing env when ../.env is absent, set minimal worker env (example):

CLICKHOUSE_URL, CLICKHOUSE_USER, CLICKHOUSE_PASSWORD, LANGFUSE_S3_EVENT_UPLOAD_BUCKET, DATABASE_URL, REDIS_CONNECTION_STRING.


Code reference (summary)

File Change
worker/src/services/ClickhouseWriter/index.ts On drop-after-exhaust: recordIncrement("langfuse.queue.clickhouse_writer.rows_dropped", droppedCount, { entity_type: tableName }) inside existing if (droppedCount > 0) block
worker/src/services/ClickhouseWriter/ClickhouseWriter.unit.test.ts Assert new metric in should drop records after max attempts

This file is for drafting the GitHub PR body. Remove it from the commit before merge, or delete after opening the PR, unless you intend to keep it in the repo.

Disclaimer: Experimental PR review

Greptile Summary

Emits a new langfuse.queue.clickhouse_writer.rows_dropped counter (tagged by entity_type) whenever rows are discarded from ClickhouseWriter after exhausting retries, and adds a corresponding unit-test assertion.

  • worker/src/services/ClickhouseWriter/index.ts: A single recordIncrement call with droppedCount and { entity_type: tableName } is inserted inside the pre-existing if (droppedCount > 0) block; no retry or drop logic is touched.
  • ClickhouseWriter.unit.test.ts: The should drop records after max attempts test gains an assertion that verifies the new metric is fired with the correct name, value, and tag when all retry attempts are exhausted.

Confidence Score: 4/5

Safe to merge; the change is additive and contained entirely within the existing error-handling block.

The only finding is that the new metric tag uses entity_type (snake_case) while the sibling recordGauge call in the same function uses entityType (camelCase), which could cause fragmented dashboards if both dimensions are queried together.

The entity_type / entityType tag inconsistency sits in worker/src/services/ClickhouseWriter/index.ts around the flush error-handling block.

Important Files Changed

Filename Overview
worker/src/services/ClickhouseWriter/index.ts Adds rows_dropped metric inside the existing droppedCount > 0 guard; minor tag key casing inconsistency with the sibling recordGauge call that uses entityType instead of entity_type.
worker/src/services/ClickhouseWriter/ClickhouseWriter.unit.test.ts New assertion correctly verifies recordIncrement is called with the right metric name, count, and entity_type tag when rows are dropped after max attempts.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[ClickhouseWriter flush] --> B{Insert succeeds?}
    B -- Yes --> C[Log debug + recordGauge queue length]
    B -- No --> D[Catch error]
    D --> E{For each queueItem}
    E --> F{item.attempts < maxAttempts?}
    F -- Yes --> G[Re-queue with attempts + 1]
    F -- No --> H[droppedCount++\nrecordIncrement error per-row]
    H --> E
    E --> I{droppedCount > 0?}
    I -- Yes --> J["recordIncrement rows_dropped\n(droppedCount, entity_type=tableName) NEW"]
    J --> K[logger.error with droppedIds]
    I -- No --> L[End]
    K --> L
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
worker/src/services/ClickhouseWriter/index.ts:523-527
**Inconsistent tag key casing with sibling metric**

The new metric uses `entity_type` (snake_case), while the `recordGauge` call a few lines above (line 496–503) uses `entityType` (camelCase) for the same dimension on the same table. If both tags land in the same Datadog/StatsD tag namespace, a dashboard querying `entity_type` will miss queue-length data and vice-versa. `entity_type` is consistent with `ingestionQueue.ts` and the broader codebase convention, so it looks like the `recordGauge` call has a pre-existing inconsistency — but it's worth aligning them in this PR to avoid fragmented dashboards.

Reviews (1): Last reviewed commit: "Merge branch 'main' into fix/worker-clic..." | Re-trigger Greptile

Emit langfuse.queue.clickhouse_writer.rows_dropped with entity_type tag
when rows are discarded after max flush attempts, alongside existing
error increments and logs.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@dosubot dosubot Bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label May 6, 2026
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 6, 2026

CLA assistant check
All committers have signed the CLA.

@sumerman sumerman added this pull request to the merge queue May 12, 2026
@dosubot dosubot Bot added the auto-merge This PR is set to be merged label May 12, 2026
Merged via the queue into langfuse:main with commit e7cf58d May 12, 2026
36 checks passed
@dosubot dosubot Bot removed the auto-merge This PR is set to be merged label May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(worker): ClickhouseWriter drops rows after max flush attempts with no DLQ

3 participants