feat: Implement host-level telemetry batching to reduce rate limiting#718

Merged

samikshya-db merged 19 commits intomainfrom

fix-telemetry-host-level-batching

Dec 3, 2025

Contributor

samikshya-db commented Dec 1, 2025 •

edited

Loading

There were 2 gaps in telemetry python before:

Per host batching was not present, which caused our pre-release steps to fail [Link]
log levels could be misleading : all telemetry log levels should be as low as possible.

Key changes:

Add _TelemetryClientHolder with reference counting for shared clients
Change TelemetryClientFactory to key clients by host_url instead of session_id
Add getHostUrlSafely() helper for defensive null handling
Update all callers (client.py, exc.py, latency_logger.py) to pass host_url

Before: 100 connections to same host = 100 separate TelemetryClients
After: 100 connections to same host = 1 shared TelemetryClient (refcount=100)

This fixes rate limiting issues seen in e2e tests where 300+ parallel connections were overwhelming the telemetry endpoint with 429 errors.

What type of PR is this?

Refactor
Feature
Bug Fix
Other

Description

How is this tested?

Unit tests
E2E Tests
Manually
N/A

Related Tickets & Documents


          feat: Implement host-level telemetry batching to reduce rate limiting

3bfbac1

Changes telemetry client architecture from per-session to per-host batching,
matching the JDBC driver implementation. This reduces the number of HTTP
requests to the telemetry endpoint and prevents rate limiting in test
environments.

Key changes:
- Add _TelemetryClientHolder with reference counting for shared clients
- Change TelemetryClientFactory to key clients by host_url instead of session_id
- Add getHostUrlSafely() helper for defensive null handling
- Update all callers (client.py, exc.py, latency_logger.py) to pass host_url

Before: 100 connections to same host = 100 separate TelemetryClients
After:  100 connections to same host = 1 shared TelemetryClient (refcount=100)

This fixes rate limiting issues seen in e2e tests where 300+ parallel
connections were overwhelming the telemetry endpoint with 429 errors.

samikshya-db had a problem deploying to azure-prod

December 1, 2025 19:26

— with

GitHub Actions Failure

samikshya-db had a problem deploying to azure-prod

December 1, 2025 19:26

— with

GitHub Actions Failure

github-actions bot commented Dec 1, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).


          chore: Change all telemetry logging to DEBUG level

5e29089

Reduces log noise by changing all telemetry-related log statements
(info, warning, error) to debug level. Telemetry operations are
background tasks and should not clutter logs with operational messages.

Changes:
- Circuit breaker state changes: info/warning -> debug
- Telemetry send failures: error -> debug
- All telemetry operations now consistently use debug level

samikshya-db had a problem deploying to azure-prod

December 1, 2025 19:27

— with

GitHub Actions Failure

samikshya-db had a problem deploying to azure-prod

December 1, 2025 19:27

— with

GitHub Actions Failure

github-actions bot commented Dec 1, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).


          chore: Fix remaining telemetry warning log to debug

26b67c7

Changes remaining logger.warning in telemetry_push_client.py to debug level
for consistency with other telemetry logging.

samikshya-db had a problem deploying to azure-prod

December 1, 2025 19:27

— with

GitHub Actions Failure

samikshya-db had a problem deploying to azure-prod

December 1, 2025 19:27

— with

GitHub Actions Failure

github-actions bot commented Dec 1, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).


          fix: Update tests to use host_url instead of session_id_hex

da13b9e

- Update circuit breaker test to check logger.debug instead of logger.info
- Replace all session_id_hex test parameters with host_url
- Apply Black formatting to exc.py and telemetry_client.py

This fixes test failures caused by the signature change from session_id_hex
to host_url in the Error class and TelemetryClientFactory.

samikshya-db had a problem deploying to azure-prod

December 1, 2025 19:41

— with

GitHub Actions Failure

samikshya-db had a problem deploying to azure-prod

December 1, 2025 19:41

— with

GitHub Actions Failure

github-actions bot commented Dec 1, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).


          fix: Revert session_id_hex in tests for functions that still use it

ffc7b27

Only Error classes changed from session_id_hex to host_url.
Other classes (TelemetryClient, ResultSetDownloadHandler, etc.) still use session_id_hex.

Reverted:
- test_telemetry.py: TelemetryClient and initialize_telemetry_client
- test_downloader.py: ResultSetDownloadHandler
- test_download_manager.py: ResultFileDownloadManager

Kept as host_url:
- test_client.py: Error class instantiation

samikshya-db had a problem deploying to azure-prod

December 1, 2025 19:48

— with

GitHub Actions Failure

samikshya-db had a problem deploying to azure-prod

December 1, 2025 19:48

— with

GitHub Actions Failure

github-actions bot commented Dec 1, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).


          fix: Update all Error raises and test calls to use host_url

b11a461

Changes:
1. client.py: Changed all error raises from session_id_hex to host_url
   - Connection class: session_id_hex=self.get_session_id_hex() -> host_url=self.session.host
   - Cursor class: session_id_hex=self.connection.get_session_id_hex() -> host_url=self.connection.session.host

2. test_telemetry.py: Updated get_telemetry_client() and close() calls
   - get_telemetry_client(session_id) -> get_telemetry_client(host_url)
   - close(session_id) -> close(host_url=host_url)

3. test_telemetry_push_client.py: Changed logger.warning to logger.debug
   - Updated test assertion to match debug logging level

These changes complete the migration from session-level to host-level
telemetry client management.

samikshya-db had a problem deploying to azure-prod

December 1, 2025 20:04

— with

GitHub Actions Failure

samikshya-db had a problem deploying to azure-prod

December 1, 2025 20:04

— with

GitHub Actions Failure

github-actions bot commented Dec 1, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).


          fix: Update thrift_backend.py to use host_url instead of session_id_hex

60e50de

Changes:
1. Added self._host attribute to store server_hostname
2. Updated all error raises to use host_url=self._host
3. Changed method signatures from session_id_hex to host_url:
   - _check_response_for_error
   - _hive_schema_to_arrow_schema
   - _col_to_description
   - _hive_schema_to_description
   - _check_direct_results_for_error
4. Updated all method calls to pass self._host instead of self._session_id_hex

This completes the migration from session-level to host-level error reporting.

samikshya-db had a problem deploying to azure-prod

December 1, 2025 20:20

— with

GitHub Actions Failure

samikshya-db had a problem deploying to azure-prod

December 1, 2025 20:20

— with

GitHub Actions Failure

github-actions bot commented Dec 1, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).


          Fix Black formatting by adjusting fmt directive placement

01ea1e1

Moved the `# fmt: on` directive to the except block level instead
of inside the if statement to resolve Black parsing confusion.

samikshya-db had a problem deploying to azure-prod

December 1, 2025 20:35

— with

GitHub Actions Failure


          Fix Black formatting in telemetry_client.py

8cb66ec

samikshya-db had a problem deploying to azure-prod

December 2, 2025 18:26

— with

GitHub Actions Failure

samikshya-db had a problem deploying to azure-prod

December 2, 2025 18:26

— with

GitHub Actions Failure

github-actions bot commented Dec 2, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).


          Use 'test-host' instead of 'test' for mock host in telemetry tests

69789ee

samikshya-db had a problem deploying to azure-prod

December 2, 2025 18:39

— with

GitHub Actions Failure

samikshya-db had a problem deploying to azure-prod

December 2, 2025 18:39

— with

GitHub Actions Failure

github-actions bot commented Dec 2, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).


          Replace test-session-id with test-host in test_client.py

b0aa889

samikshya-db had a problem deploying to azure-prod

December 2, 2025 18:43

— with

GitHub Actions Failure

samikshya-db had a problem deploying to azure-prod

December 2, 2025 18:43

— with

GitHub Actions Failure

github-actions bot commented Dec 2, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).


          Fix telemetry client lookup to use test-host in tests

c8cfc23

samikshya-db had a problem deploying to azure-prod

December 2, 2025 18:47

— with

GitHub Actions Failure

samikshya-db had a problem deploying to azure-prod

December 2, 2025 18:47

— with

GitHub Actions Failure

github-actions bot commented Dec 2, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).


          Make session_id_hex keyword-only parameter in Error.__init__

962def5

samikshya-db temporarily deployed to azure-prod

December 2, 2025 19:45

— with

GitHub Actions Inactive

samikshya-db had a problem deploying to azure-prod

December 2, 2025 19:45

— with

GitHub Actions Failure

github-actions bot commented Dec 2, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

nikhilsuri-db reviewed

View reviewed changes

Contributor

nikhilsuri-db left a comment

LGTM

please manually run Daily Telemetry E2E Tests
Can you also fix [daily-telemetry-e2e.yml ] to run daily not evey sunday [ seems gap there ]

samikshya-db temporarily deployed to azure-prod

December 3, 2025 05:46

— with

GitHub Actions Inactive

Contributor Author

samikshya-db commented Dec 3, 2025

Thanks for your review @nikhilsuri-db

Ran it now, https://github.com/databricks/databricks-sql-python/actions/runs/19883810489 : the run passes.
I'll fix it in a later PR if that is ok?

samikshya-db had a problem deploying to azure-prod

December 3, 2025 06:55

— with

GitHub Actions Failure

samikshya-db marked this pull request as ready for review

December 3, 2025 07:03

nikhilsuri-db approved these changes

View reviewed changes

samikshya-db merged commit ebe4b07 into main

29 of 32 checks passed

samikshya-db deleted the fix-telemetry-host-level-batching branch

December 3, 2025 15:09

msrathore-db added a commit that referenced this pull request


          Fix CI test failure: Prevent parallel execution of telemetry tests

4b2da91

Add @pytest.mark.xdist_group to telemetry test classes to ensure they
run sequentially on the same worker when using pytest-xdist (-n auto).

Root cause: Tests marked @pytest.mark.serial were still being
parallelized in CI because pytest-xdist doesn't respect custom markers
by default. With host-level telemetry batching (PR #718), tests
running in parallel would share the same TelemetryClient and interfere
with each other's event counting, causing test_concurrent_queries_sends_telemetry
to see 88 events instead of the expected 60.

The xdist_group marker ensures all tests in the "serial_telemetry"
group run on the same worker sequentially, preventing state interference.

Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

msrathore-db added a commit that referenced this pull request


          [PECOBLR-1735] Fix #729 and #731: Telemetry lifecycle management (#734)

61f8029

* Fix #729 and #731: Telemetry lifecycle management

Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>

* Address review comments: revert timeout and telemetry_enabled changes

Per reviewer feedback on PR #734:

1. Revert timeout from 30s back to 900s (line 299)
   - Reviewer noted that with wait=False, timeout is not critical
   - The async nature and wait=False handle the exit speed

2. Revert telemetry_enabled parameter back to True (line 734)
   - Reviewer noted this is redundant given the early return
   - If enable_telemetry=False, we return early (line 729)
   - Line 734 only executes when enable_telemetry=True
   - Therefore using the parameter here is unnecessary

These changes address the reviewer's valid technical concerns while
keeping the core fixes intact:
- wait=False for non-blocking shutdown (critical for Issue #729)
- Early return when enable_telemetry=False (critical for Issue #729)
- All Issue #731 fixes (null-safety, __del__, documentation)

Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>

* Fix Black formatting violations

Apply Black formatting to files modified in previous commits:
- src/databricks/sql/common/unified_http_client.py
- src/databricks/sql/telemetry/telemetry_client.py

Changes are purely cosmetic (quote style consistency).

Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>

* Fix CI test failure: Prevent parallel execution of telemetry tests

Add @pytest.mark.xdist_group to telemetry test classes to ensure they
run sequentially on the same worker when using pytest-xdist (-n auto).

Root cause: Tests marked @pytest.mark.serial were still being
parallelized in CI because pytest-xdist doesn't respect custom markers
by default. With host-level telemetry batching (PR #718), tests
running in parallel would share the same TelemetryClient and interfere
with each other's event counting, causing test_concurrent_queries_sends_telemetry
to see 88 events instead of the expected 60.

The xdist_group marker ensures all tests in the "serial_telemetry"
group run on the same worker sequentially, preventing state interference.

Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix telemetry test fixtures: Clean up state before AND after tests

Modified telemetry_setup_teardown fixtures to clean up
TelemetryClientFactory state both BEFORE and AFTER each test, not just
after. This prevents leftover state from previous tests (pending events,
active executors) from interfering with the current test.

Root cause: In CI with sequential execution on the same worker, if a
previous test left pending telemetry events in the executor, those
events could be captured by the next test's mock, causing inflated
event counts (88 instead of 60).

Now ensures complete isolation between tests by resetting all shared
state before each test starts.

Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix CI test failure: Clear _flush_event between tests

The _flush_event threading.Event was never cleared after stopping the
flush thread, remaining in "set" state. This caused timing issues in
subsequent tests where the Event was already signaled, triggering
unexpected flush behavior and causing extra telemetry events to be
captured (88 instead of 60).

Now explicitly clear the _flush_event flag in both setup (before test)
and teardown (after test) to ensure clean state isolation between tests.

This explains why CI consistently got 88 events - the flush_event from
previous tests triggered additional flushes during test execution.

Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

* Add debug workflow and output to diagnose CI test failure

1. Created new workflow 'test-telemetry-only.yml' that runs only the
   failing telemetry test with -n auto, mimicking real CI but much faster

2. Added debug output to test showing:
   - Client-side captured events
   - Number of futures/batches
   - Number of server responses
   - Server-reported successful events

This will help identify why CI gets 88 events vs local 60 events.

Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix workflow: Add krb5 system dependency

The workflow was failing during poetry install due to missing krb5
system libraries needed for kerberos dependencies.

Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix xdist_group: Add --dist=loadgroup to pytest commands

The @pytest.mark.xdist_group markers were being ignored because
pytest-xdist uses --dist=load by default, which doesn't respect groups.

With --dist=loadgroup, tests in the same xdist_group run sequentially
on the same worker, preventing telemetry state interference between
tests.

This is the ROOT CAUSE of the 88 vs 60 events issue - tests were
running in parallel across workers instead of sequentially on one
worker as intended.

Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

* Add aggressive flush before test to prevent event interference

CI shows 72 events instead of 60. Debug output reveals:
- Client captured: 60 events (correct)
- Server received: 72 events across 2 batches

The 12 extra events accumulate in the timing window between fixture
cleanup and mock setup. Other tests (like circuit breaker tests not in
our xdist_group) may be sending telemetry concurrently.

Solution: Add an explicit flush+shutdown RIGHT BEFORE setting up the
mock to ensure a completely clean slate with zero buffered events.

Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

* Split workflow: Isolate telemetry tests in separate job

To prevent interference from other e2e tests, split into two jobs:

Job 1 (run-non-telemetry-tests):
- Runs all e2e tests EXCEPT telemetry tests
- Uses -n auto for parallel execution

Job 2 (run-telemetry-tests):
- Runs ONLY telemetry tests
- Depends on Job 1 completing (needs: run-non-telemetry-tests)
- Fresh Python process = complete isolation
- No ambient telemetry from other tests

This eliminates the 68 vs 60 event discrepancy by ensuring
telemetry tests run in a clean environment with zero interference.

Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix workflows: Add krb5 deps and cleanup debug code

Changes across multiple workflows:

1. integration.yml:
   - Add krb5 system dependency to telemetry job
   - Fixes: krb5-config command not found error during poetry install

2. code-coverage.yml:
   - Add krb5 system dependency
   - Split telemetry tests into separate step for isolation
   - Maintains coverage accumulation with --cov-append

3. publish-test.yml:
   - Add krb5 system dependency for consistent builds

4. test_concurrent_telemetry.py:
   - Remove debug print statements

5. Delete test-telemetry-only.yml:
   - Remove temporary debug workflow

All workflows now have proper telemetry test isolation and
required system dependencies for kerberos packages.

Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix publish-test.yml: Update Python 3.9 -> 3.10

Poetry 2.3.2 installation fails with Python 3.9:
  Installing Poetry (2.3.2): An error occurred.

Other workflows use Python 3.10 and work fine. Updating to match
ensures consistency and avoids Poetry installation issues.

Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix integration workflow: Remove --dist=loadgroup from non-telemetry tests

- Remove --dist=loadgroup from non-telemetry job (only needed for telemetry)
- Remove test_telemetry_e2e.py from telemetry job (was skipped before)
- This should fix test_uc_volume_life_cycle failure caused by changed test distribution

* Fix code-coverage workflow: Remove test_telemetry_e2e.py from coverage tests

- Only run test_concurrent_telemetry.py in isolated telemetry step
- test_telemetry_e2e.py was excluded in original workflow, keep it excluded

* Fix publish-test workflow: Remove cache conditional

- Always run poetry install (not just on cache miss)
- Ensures fresh install with system dependencies (krb5)
- Matches pattern used in integration.yml

* Fix publish-test.yml: Remove duplicate krb5 install, restore cache conditional

- Remove duplicate system dependencies step
- Restore cache conditional to match main branch
- Keep Python 3.10 (our change from 3.9)

* Fix code-coverage: Remove serial tests step

- All serial tests are telemetry tests (test_concurrent_telemetry.py and test_telemetry_e2e.py)
- They're already run in the isolated telemetry step
- Running -m serial with --ignore on both files results in 0 tests (exit code 5)

---------

Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet