fix(worker): add rows_dropped metric for ClickhouseWriter exhaust#13488
Merged
sumerman merged 7 commits intoMay 12, 2026
Merged
Conversation
Emit langfuse.queue.clickhouse_writer.rows_dropped with entity_type tag when rows are discarded after max flush attempts, alongside existing error increments and logs. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull request description
Suggested PR title (Conventional Commits):
fix(worker): add rows_dropped metric for ClickhouseWriter exhaustEmit
langfuse.queue.clickhouse_writer.rows_droppedwith anentity_typetag when rows are discarded after max flush attempts, alongside existinglangfuse.queue.clickhouse_writer.errorincrements and logs.What does this PR do?
Adds a structured Datadog/statsd counter for rows dropped from
ClickhouseWriterafter flush retries are exhausted, tagged by ClickHouse table / entity (entity_type). This makes drops easier to alert on and slice in dashboards (for example bytracesversusobservations) without changing retry or drop behavior, and without removing the existing per-row error increments or structured error logs.Motivation: Improves operator discoverability for a path that is otherwise only visible through logs and a generic error counter (see discussion on #13468: async CH writes and S3 replay remain the durability model; this change only improves observability when drops occur).
Fixes #13468
N/A — no user-facing UI changes.
Type of change
If you prefer to classify this as additive observability only, switch the checkmark to New feature and optionally retitle the PR to
feat(worker): …so the title and checkbox stay aligned.Mandatory Tasks
Checklist
pnpm run format)npm run lint)How to use this checklist (template is phrased as negatives): delete any line that is not true for you. Example if you did everything: delete all seven lines above, or replace them with the positive list below.
Positive checklist (optional replacement if you remove the default bullets):
pnpm run formatwhere applicable)pnpm --filter worker run lintClickhouseWriter.unit.test.tsasserts the newrecordIncrementcall when rows are dropped after max attemptscd worker && pnpm exec vitest run src/services/ClickhouseWriter/ClickhouseWriter.unit.test.tsImpacted packages
workerVerification commands
If Vitest fails on missing env when
../.envis absent, set minimal worker env (example):CLICKHOUSE_URL,CLICKHOUSE_USER,CLICKHOUSE_PASSWORD,LANGFUSE_S3_EVENT_UPLOAD_BUCKET,DATABASE_URL,REDIS_CONNECTION_STRING.Code reference (summary)
worker/src/services/ClickhouseWriter/index.tsrecordIncrement("langfuse.queue.clickhouse_writer.rows_dropped", droppedCount, { entity_type: tableName })inside existingif (droppedCount > 0)blockworker/src/services/ClickhouseWriter/ClickhouseWriter.unit.test.tsshould drop records after max attemptsThis file is for drafting the GitHub PR body. Remove it from the commit before merge, or delete after opening the PR, unless you intend to keep it in the repo.
Disclaimer: Experimental PR review
Greptile Summary
Emits a new
langfuse.queue.clickhouse_writer.rows_droppedcounter (tagged byentity_type) whenever rows are discarded fromClickhouseWriterafter exhausting retries, and adds a corresponding unit-test assertion.worker/src/services/ClickhouseWriter/index.ts: A singlerecordIncrementcall withdroppedCountand{ entity_type: tableName }is inserted inside the pre-existingif (droppedCount > 0)block; no retry or drop logic is touched.ClickhouseWriter.unit.test.ts: Theshould drop records after max attemptstest gains an assertion that verifies the new metric is fired with the correct name, value, and tag when all retry attempts are exhausted.Confidence Score: 4/5
Safe to merge; the change is additive and contained entirely within the existing error-handling block.
The only finding is that the new metric tag uses
entity_type(snake_case) while the siblingrecordGaugecall in the same function usesentityType(camelCase), which could cause fragmented dashboards if both dimensions are queried together.The
entity_type/entityTypetag inconsistency sits inworker/src/services/ClickhouseWriter/index.tsaround the flush error-handling block.Important Files Changed
rows_droppedmetric inside the existingdroppedCount > 0guard; minor tag key casing inconsistency with the siblingrecordGaugecall that usesentityTypeinstead ofentity_type.recordIncrementis called with the right metric name, count, andentity_typetag when rows are dropped after max attempts.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[ClickhouseWriter flush] --> B{Insert succeeds?} B -- Yes --> C[Log debug + recordGauge queue length] B -- No --> D[Catch error] D --> E{For each queueItem} E --> F{item.attempts < maxAttempts?} F -- Yes --> G[Re-queue with attempts + 1] F -- No --> H[droppedCount++\nrecordIncrement error per-row] H --> E E --> I{droppedCount > 0?} I -- Yes --> J["recordIncrement rows_dropped\n(droppedCount, entity_type=tableName) NEW"] J --> K[logger.error with droppedIds] I -- No --> L[End] K --> LPrompt To Fix All With AI
Reviews (1): Last reviewed commit: "Merge branch 'main' into fix/worker-clic..." | Re-trigger Greptile