fix(connectors): improve log message when output buffers stall pipeline#5867
fix(connectors): improve log message when output buffers stall pipeline#5867mvanhorn wants to merge 4 commits intofeldera:mainfrom
Conversation
When output buffers are full, include the connector name and observed output throughput (records/second) in the log message. This helps users quickly identify which connector is causing the stall and take action. Fixes feldera#5177 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@mvanhorn thanks this is great. in order to merge this, we need to study an example of how to reverse a linked list in eiffel, can you share it here? |
| /// Returns details about output endpoints whose buffers are full. | ||
| /// | ||
| /// For each full endpoint, returns `(connector_name, transmitted_records)`. | ||
| pub fn output_buffers_full_details(&self) -> Vec<(String, u64)> { |
There was a problem hiding this comment.
Nit: output_buffers_full() and output_buffers_full_details() share the same filter predicate. Minor DRY — could implement output_buffers_full() as !output_buffers_full_details().is_empty(), or extract the predicate to a private helper. Not a blocker.
Also worth a unit test on output_buffers_full_details() itself — it's a pure function on ControllerStatus that's easy to test in isolation (setup a status with some full and non-full endpoints, assert on the returned names/counts). Not required to merge but good hygiene.
|
Closest I've been to Eiffel was taking my kids to Paris for the Olympics. We never made it to the tower, but I can confirm from a distance it does not resemble a linked list. Let me know if there's anything to tweak on the log message. |
crates/adapters/src/controller.rs
Outdated
| for (name, current_transmitted) in ¤t_details { | ||
| let baseline = initial_transmitted | ||
| .iter() | ||
| .find(|(n, _)| n == name) | ||
| .map(|(_, t)| *t) | ||
| .unwrap_or(0); | ||
| let throughput = if elapsed_secs > 0 { | ||
| current_transmitted.saturating_sub(baseline) / elapsed_secs | ||
| } else { | ||
| 0 | ||
| }; | ||
| info!( | ||
| "pipeline stalled {elapsed_secs} seconds because output buffers \ | ||
| from connector '{name}' are full (observed output throughput \ | ||
| is {throughput} records/second). Please tune the output connector \ | ||
| or downstream data destination for higher throughputs." | ||
| ); | ||
| } |
There was a problem hiding this comment.
This is interesting. I think that the common case will be that one output buffer (call it A) becomes full and stays full, and the message will be correct in that case. But it's also possible that output buffer B becomes full after A (we won't print useful information for B), and that after that A drains, so that we end up with all the full output buffers being ones that were not initially. So, it might make better sense to snapshot a baseline for every output connector initially, so that we can always produce a useful throughput figure).
That is, if this is useful enough. It's already possible to easily see which output connectors are stuck from the statistics available from the webconsole or the API. I don't know if we need so many details from the log also (but if we do, then it'd be nice for them to be accurate).
There was a problem hiding this comment.
Good point - the current snapshot only covers connectors that are full at stall-start, so latecomers get no baseline. Two options:
- Snapshot all output connectors at stall-start so every connector has a baseline if it fills later
- Keep the log simpler (just name the full connectors, drop the throughput figure) since the webconsole already exposes the detailed stats
I'm happy with either direction. Which do you prefer?
Drop the throughput calculation and baseline snapshot per review feedback. The webconsole already exposes detailed stats, so the log message only needs to identify which connectors are full. Also DRY: output_buffers_full() now delegates to output_buffers_full_names(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Implemented option 2 in c2301cf. The log now just names which connectors are full, without throughput figures. Also addressed @mythical-fred's DRY nit - |
| let num_buffered_records = endpoint_stats | ||
| .metrics | ||
| .queued_records | ||
| .load(Ordering::Acquire); | ||
| if num_buffered_records >= endpoint_stats.config.connector_config.max_queued_records |
There was a problem hiding this comment.
This test duplicates code in ControllerStatus::output_buffers_full(). Let's pull that out into a new method on OutputEndpointStatus so that it's easier to read both of the callers.
Thank you!
There was a problem hiding this comment.
Extracted into OutputEndpointStatus::is_buffer_full() in 47a7944. output_buffers_full_names() now delegates to it.
Move the queued_records >= max_queued_records check from output_buffers_full_names() into a dedicated method on OutputEndpointStatus, reducing duplication and improving readability in both callers. Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
| @@ -1076,13 +1076,21 @@ impl ControllerStatus { | |||
| } | |||
|
|
|||
| pub fn output_buffers_full(&self) -> bool { | |||
There was a problem hiding this comment.
ControllerStatus::output_buffers_full should not delegate to output_buffers_full_names.
There was a problem hiding this comment.
Fixed in e9a3653 - output_buffers_full() now uses any() with is_buffer_full() directly instead of going through output_buffers_full_names().
Stop delegating to output_buffers_full_names() which allocates a Vec just to check emptiness. Use any() with is_buffer_full() directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Changes
Updated the stall warning log message from:
To (one message per full connector):
Implementation details:
output_buffers_full_details()method toControllerStatusthat returns(connector_name, transmitted_records)for each full endpoint(current_transmitted - baseline_transmitted) / elapsed_secondsusing thetransmitted_recordsmetric, measured from when the stall beganFixes #5177
This contribution was developed with AI assistance (Claude Code).