Skip to content

fix(connectors): improve log message when output buffers stall pipeline#5867

Open
mvanhorn wants to merge 4 commits intofeldera:mainfrom
mvanhorn:fix/better-stall-log-message
Open

fix(connectors): improve log message when output buffers stall pipeline#5867
mvanhorn wants to merge 4 commits intofeldera:mainfrom
mvanhorn:fix/better-stall-log-message

Conversation

@mvanhorn
Copy link
Copy Markdown

Summary

  • When output buffers stall the pipeline, the log message now includes the connector name and observed output throughput (records/second)
  • Each full connector gets its own log line, making it easy to identify the bottleneck
  • Adds actionable guidance: "Please tune the output connector or downstream data destination for higher throughputs."

Changes

Updated the stall warning log message from:

pipeline stalled {N} seconds because output buffers are full

To (one message per full connector):

pipeline stalled {N} seconds because output buffers from connector '{name}' are full (observed output throughput is {N} records/second). Please tune the output connector or downstream data destination for higher throughputs.

Implementation details:

  • Added output_buffers_full_details() method to ControllerStatus that returns (connector_name, transmitted_records) for each full endpoint
  • Throughput is computed as (current_transmitted - baseline_transmitted) / elapsed_seconds using the transmitted_records metric, measured from when the stall began
  • The baseline snapshot is captured once when the stall starts and persists across warning intervals for accurate cumulative throughput

Fixes #5177

This contribution was developed with AI assistance (Claude Code).

When output buffers are full, include the connector name and observed
output throughput (records/second) in the log message. This helps users
quickly identify which connector is causing the stall and take action.

Fixes feldera#5177

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gz
Copy link
Copy Markdown
Contributor

gz commented Mar 19, 2026

@mvanhorn thanks this is great. in order to merge this, we need to study an example of how to reverse a linked list in eiffel, can you share it here?

/// Returns details about output endpoints whose buffers are full.
///
/// For each full endpoint, returns `(connector_name, transmitted_records)`.
pub fn output_buffers_full_details(&self) -> Vec<(String, u64)> {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: output_buffers_full() and output_buffers_full_details() share the same filter predicate. Minor DRY — could implement output_buffers_full() as !output_buffers_full_details().is_empty(), or extract the predicate to a private helper. Not a blocker.

Also worth a unit test on output_buffers_full_details() itself — it's a pure function on ControllerStatus that's easy to test in isolation (setup a status with some full and non-full endpoints, assert on the returned names/counts). Not required to merge but good hygiene.

@mvanhorn
Copy link
Copy Markdown
Author

Closest I've been to Eiffel was taking my kids to Paris for the Olympics. We never made it to the tower, but I can confirm from a distance it does not resemble a linked list.

Let me know if there's anything to tweak on the log message.

Comment on lines +2571 to +2588
for (name, current_transmitted) in &current_details {
let baseline = initial_transmitted
.iter()
.find(|(n, _)| n == name)
.map(|(_, t)| *t)
.unwrap_or(0);
let throughput = if elapsed_secs > 0 {
current_transmitted.saturating_sub(baseline) / elapsed_secs
} else {
0
};
info!(
"pipeline stalled {elapsed_secs} seconds because output buffers \
from connector '{name}' are full (observed output throughput \
is {throughput} records/second). Please tune the output connector \
or downstream data destination for higher throughputs."
);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. I think that the common case will be that one output buffer (call it A) becomes full and stays full, and the message will be correct in that case. But it's also possible that output buffer B becomes full after A (we won't print useful information for B), and that after that A drains, so that we end up with all the full output buffers being ones that were not initially. So, it might make better sense to snapshot a baseline for every output connector initially, so that we can always produce a useful throughput figure).

That is, if this is useful enough. It's already possible to easily see which output connectors are stuck from the statistics available from the webconsole or the API. I don't know if we need so many details from the log also (but if we do, then it'd be nice for them to be accurate).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - the current snapshot only covers connectors that are full at stall-start, so latecomers get no baseline. Two options:

  1. Snapshot all output connectors at stall-start so every connector has a baseline if it fills later
  2. Keep the log simpler (just name the full connectors, drop the throughput figure) since the webconsole already exposes the detailed stats

I'm happy with either direction. Which do you prefer?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #2 is better.

Drop the throughput calculation and baseline snapshot per review
feedback. The webconsole already exposes detailed stats, so the log
message only needs to identify which connectors are full.

Also DRY: output_buffers_full() now delegates to
output_buffers_full_names().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mvanhorn
Copy link
Copy Markdown
Author

Implemented option 2 in c2301cf. The log now just names which connectors are full, without throughput figures. Also addressed @mythical-fred's DRY nit - output_buffers_full() now delegates to output_buffers_full_names().

Comment on lines +1087 to +1091
let num_buffered_records = endpoint_stats
.metrics
.queued_records
.load(Ordering::Acquire);
if num_buffered_records >= endpoint_stats.config.connector_config.max_queued_records
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test duplicates code in ControllerStatus::output_buffers_full(). Let's pull that out into a new method on OutputEndpointStatus so that it's easier to read both of the callers.

Thank you!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted into OutputEndpointStatus::is_buffer_full() in 47a7944. output_buffers_full_names() now delegates to it.

Move the queued_records >= max_queued_records check from
output_buffers_full_names() into a dedicated method on
OutputEndpointStatus, reducing duplication and improving
readability in both callers.

Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
@@ -1076,13 +1076,21 @@ impl ControllerStatus {
}

pub fn output_buffers_full(&self) -> bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ControllerStatus::output_buffers_full should not delegate to output_buffers_full_names.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e9a3653 - output_buffers_full() now uses any() with is_buffer_full() directly instead of going through output_buffers_full_names().

Stop delegating to output_buffers_full_names() which allocates a Vec
just to check emptiness. Use any() with is_buffer_full() directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Better log message when output buffers remain full

4 participants