Skip to content

sds: fix handling of cfl_sds return buffers after cfl upgrade#11813

Open
edsiper wants to merge 7 commits into
masterfrom
fix-valgrind-strict-leaks
Open

sds: fix handling of cfl_sds return buffers after cfl upgrade#11813
edsiper wants to merge 7 commits into
masterfrom
fix-valgrind-strict-leaks

Conversation

@edsiper
Copy link
Copy Markdown
Member

@edsiper edsiper commented May 16, 2026

  • retain resized SDS response buffers in the HTTP/1 server
  • propagate HTTP request parameter/auth setup failures
  • clean up OpenTelemetry auth setup failure paths
  • free dynamic Kafka topic split lists
  • clean up Fluent Bit startup failure paths

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • Bug Fixes

    • Improved error handling and resource cleanup across Prometheus decoding, HTTP client/auth handling, and output plugins (Kafka, OpenTelemetry)
    • Fixed response buffer handling and trace output initialization
  • New Features

    • Integration tests: per-test teardown to stop active services and Valgrind-aware test timeouts
  • Chores

    • Incremented cmetrics library patch version

Review Change Stack

edsiper added 6 commits May 16, 2026 07:30
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

📝 Walkthrough

Walkthrough

Bumps cmetrics patch version and applies coordinated fixes: Prometheus decoder/lexer error propagation, HTTP client parameter validation and auth-buffer cleanup, plugin/core resource destruction on failures, plus integration test service tracking and teardown.

Changes

Resource Management and Error Handling Improvements

Layer / File(s) Summary
Library version patch update
lib/cmetrics/CMakeLists.txt
CMT_VERSION_PATCH incremented from 3 to 4, updating derived CMT_VERSION_STR and downstream packaging/version outputs.
Prometheus decoder error handling and lexer string management
lib/cmetrics/src/cmt_decode_prometheus.c, lib/cmetrics/src/cmt_decode_prometheus.l
Decoder now prioritizes context.errcode, destroys created cmt and resets context on error. Lexer centralizes string-buffer allocation/append and token assignment (STRBUF_CREATE, STRBUF_APPEND, SET_STR_TOKEN), replacing direct SDS calls in HELPTAG/TYPETAG, quoted-string, and IDENTIFIER actions with unified allocation/error handling.
HTTP client parameter validation and basic auth cleanup
src/flb_http_client.c
flb_http_encode_basic_auth_value() frees partial output and returns the real result on failure. flb_http_request_set_parameters_internal() checks return codes for all setters (method, host, URL, URI, user-agent, content-type, body, basic/bearer/SIGNV4 auth); on any setter error it logs and fails overall.
Resource cleanup in plugins, HTTP response, and startup
plugins/out_kafka/kafka.c, plugins/out_opentelemetry/opentelemetry.c, src/fluent-bit.c, src/http_server/flb_http_server_http1.c
Kafka frees flb_utils_split list. OpenTelemetry destroys HTTP request on basic-auth parameter failure. Fluent Bit startup defers trace_output default and calls flb_destroy(ctx) on certain config/service load failures. HTTP/1 response commit stores concatenation result back into response_buffer.
Integration tests: lifecycle and valgrind timeouts
tests/integration/src/utils/test_service.py, tests/integration/conftest.py, tests/integration/scenarios/out_http/tests/test_out_http_001.py
Adds _active_services tracking and stop_active_services(); FluentBitTestService.start()/stop() updated to register/unregister and conditionally run post_stop. Conftest teardown calls stop_active_services. Test adds _valgrind_timeout() and applies it to wait helpers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • fluent/fluent-bit#11347: Overlaps changes in plugins/out_opentelemetry/opentelemetry.c touching the same request path.

Suggested reviewers

  • cosmo0920
  • braydonk

🐰 I hopped through buffers, tokens, and trace,
Assigned the pointers to their proper place,
Freed stray lists and closed the request,
Reset the context and handled the rest,
Now the patch version shines with tidy grace.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main objective of the changeset: fixing handling of SDS (Simple Dynamic String) return buffers following a cfl library upgrade, which is reflected across multiple files addressing buffer management, error handling, and resource cleanup.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-valgrind-strict-leaks

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration/src/utils/test_service.py`:
- Around line 106-110: The current teardown calls self.post_stop(self) which, if
it raises, prevents the subsequent cleanup calls self._restore_env(),
self._allocated_ports.clear(), and _active_services.discard(self) from running;
wrap the post_stop invocation in a try/except/finally (or try/finally) so that
regardless of exceptions from post_stop the environment restore, port clearing,
and active service discard always execute, and ensure any exception from
post_stop is either logged via the existing logger or re-raised after cleanup to
preserve failure visibility.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cd80c00e-fc7c-4cae-a046-8422dabf27bc

📥 Commits

Reviewing files that changed from the base of the PR and between 0b847fb and 3605fed.

📒 Files selected for processing (3)
  • tests/integration/conftest.py
  • tests/integration/scenarios/out_http/tests/test_out_http_001.py
  • tests/integration/src/utils/test_service.py

Comment on lines +106 to +110
if self.post_stop and had_allocated_ports:
self.post_stop(self)
self._restore_env()
self._allocated_ports.clear()
_active_services.discard(self)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Ensure teardown cleanup always runs even if post_stop fails.

On Line 106, an exception from post_stop will skip Lines 108-110, leaving env vars, allocated ports, and _active_services stale. This can cascade into later tests.

Suggested fix
 def stop(self):
     had_allocated_ports = bool(self._allocated_ports)
+    stop_error = None

     try:
         if self.flb:
             self.flb.stop()
+    except Exception as exc:
+        stop_error = exc
     finally:
-        if self.post_stop and had_allocated_ports:
-            self.post_stop(self)
-        self._restore_env()
-        self._allocated_ports.clear()
-        _active_services.discard(self)
+        try:
+            if self.post_stop and had_allocated_ports:
+                self.post_stop(self)
+        except Exception as exc:
+            if stop_error is None:
+                stop_error = exc
+        finally:
+            self._restore_env()
+            self._allocated_ports.clear()
+            _active_services.discard(self)
+
+    if stop_error is not None:
+        raise stop_error
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/src/utils/test_service.py` around lines 106 - 110, The
current teardown calls self.post_stop(self) which, if it raises, prevents the
subsequent cleanup calls self._restore_env(), self._allocated_ports.clear(), and
_active_services.discard(self) from running; wrap the post_stop invocation in a
try/except/finally (or try/finally) so that regardless of exceptions from
post_stop the environment restore, port clearing, and active service discard
always execute, and ensure any exception from post_stop is either logged via the
existing logger or re-raised after cleanup to preserve failure visibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant