Skip to content

Add vLLM‑style Runtime Metrics (Inference + Training) with Opt‑In Telemetry#3897

Open
hnxnq7 wants to merge 16 commits intounslothai:mainfrom
hnxnq7:metrics-collection-clean
Open

Add vLLM‑style Runtime Metrics (Inference + Training) with Opt‑In Telemetry#3897
hnxnq7 wants to merge 16 commits intounslothai:mainfrom
hnxnq7:metrics-collection-clean

Conversation

@hnxnq7
Copy link
Contributor

@hnxnq7 hnxnq7 commented Jan 16, 2026

Add vLLM-style Runtime Metrics for Inference & Training (with Optional Telemetry)

This PR adds an opt-in runtime metrics system to Unsloth, inspired by vLLM’s metrics architecture, with optional Prometheus export and optional server-side telemetry forwarding.

What this enables

  • Inference metrics: request counts, token counts, throughput, latency histograms (E2E, prefill, decode)
  • Training metrics: steps, samples/sec, loss, LR, gradient norm, forward/backward timing
  • Prometheus support (optional) with /metrics HTTP endpoint
  • Programmatic access to metrics (no server required)
  • Optional telemetry forwarding of aggregated metrics to Unsloth servers

How it works

  • Metrics are disabled by default
  • Calling enable_prometheus_metrics() automatically instruments:
    • unsloth_base_fast_generate() (inference)
    • Trainer.training_step() via a patch hook (training)
  • Telemetry forwarding is opt-in and non-blocking
    • Enabled via UNSLOTH_ENABLE_METRICS_TELEMETRY=1 or enable_telemetry()
    • Can be disabled via UNSLOTH_DISABLE_METRICS_TELEMETRY=1
  • No user code changes required beyond enabling metrics

Key design points

  • Fully opt-in, no breaking changes
  • Graceful degradation if prometheus_client is not installed
  • Lightweight + low overhead
  • Inspired by vLLM’s metrics model, adapted to Transformers-based pipelines
  • Thread-safe singleton pattern
  • Handles ModelOutput objects when return_dict_in_generate=True
  • Telemetry sends aggregated stats only (counts / averages, no raw prompts or user data)

Files changed (13 files)

  • New module: unsloth/metrics/ (6 files)
    • stats.py – Core statistics tracking (InferenceStats, TrainingStats, StatsCollector)
    • prometheus.py – Prometheus export with Counter / Gauge / Histogram metrics
    • server.py – Optional HTTP server for metrics scraping
    • telemetry.py – Optional background telemetry sender (aggregated stats only)
    • README.md – Documentation
  • Training hook: _patch_training_metrics() in unsloth/models/_utils.py
  • Inference hook: unsloth_base_fast_generate() in unsloth/models/vision.py
  • Public API exports: via unsloth/__init__.py
  • Tests: tests/metrics/test_metrics_standalone.py (all passing)
  • Dependencies: pyproject.toml (optional prometheus_client)

Quick usage

from unsloth import enable_prometheus_metrics, get_stats_collector

enable_prometheus_metrics()

# run inference / training as usual

stats = get_stats_collector().get_all_stats()
print(stats["inference"])  # request counts, latencies, tokens/sec
print(stats["training"])   # steps, loss, samples/sec

Notes

  • Telemetry is opt-in by default (can be flipped easily if preferred)
  • Uses a background sender (non-blocking, silent failures)
  • Endpoint is configurable via UNSLOTH_METRICS_TELEMETRY_ENDPOINT
  • Current default endpoint is a placeholder pending server-side confirmation

Testing

No breaking changes. Purely additive.

Rachel Li and others added 14 commits January 15, 2026 17:07
- Comprehensive metrics collection system (inference + training)
- Prometheus-compatible export with optional HTTP server
- Programmatic access to metrics
- Automatic instrumentation of inference and training loops
- All tests passing
- Resolved merge conflict in vision.py (kept ModelOutput handling)
- Added documentation notes about estimated vs measured metrics
- Improved finish_reason detection
- All syntax errors resolved
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @hnxnq7, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a sophisticated, opt-in runtime metrics system into Unsloth, drawing inspiration from vLLM's architecture. It enables users to gain deep insights into the performance of their inference and training workloads by automatically collecting a wide array of statistics, including latencies, token counts, throughput, loss, and learning rates. The system offers flexible access to these metrics through programmatic APIs, optional Prometheus export with a dedicated HTTP server, and a privacy-conscious, opt-in telemetry mechanism for aggregated data. This enhancement empowers users with better monitoring and optimization capabilities for their Unsloth-powered applications.

Highlights

  • Comprehensive Runtime Metrics: Introduces an opt-in system for collecting detailed inference and training performance metrics, inspired by vLLM.
  • Inference & Training Data: Tracks key metrics such as request counts, token throughput, E2E/prefill/decode latencies for inference, and steps, samples/sec, loss, LR, and gradient norm for training.
  • Prometheus Integration: Provides optional Prometheus-compatible metrics export via standard Counter, Gauge, and Histogram types, including an optional HTTP server with a '/metrics' endpoint.
  • Opt-in Telemetry: Implements an optional, non-blocking background telemetry sender for aggregated statistics to Unsloth servers, configurable via environment variables.
  • Automatic Instrumentation: Automatically hooks into 'unsloth_base_fast_generate()' for inference and 'Trainer.training_step()' for training, requiring minimal user code changes to enable.
  • Robust Design: Features a lightweight, low-overhead, thread-safe singleton pattern, graceful degradation if 'prometheus_client' is absent, and no breaking changes.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive, opt-in runtime metrics system for both inference and training, inspired by vLLM's architecture. The changes are well-structured into a new unsloth/metrics module, providing features like Prometheus export, an optional HTTP server, and non-blocking telemetry. The implementation is robust, with graceful degradation for optional dependencies and careful patching to instrument the training and inference pipelines.

My review focuses on improving maintainability and fixing a potential bug in the metrics collection logic. I've identified a dependency on a private API in the Prometheus integration and a confusing and potentially buggy section for determining the finish_reason in inference metrics. Overall, this is an excellent and well-documented feature addition.

Comment on lines +80 to +83
def _get_existing_collector(metric_name: str):
if REGISTRY is None:
return None
return REGISTRY._names_to_collectors.get(metric_name) # type: ignore[attr-defined]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing the internal _names_to_collectors attribute of the Prometheus registry is a bit fragile, as it's not part of the public API and could change in future versions of prometheus-client, potentially breaking this code. While this is a common and pragmatic workaround to handle metric re-registration issues in environments like Jupyter notebooks, it's worth being aware of the maintainability risk for future library updates.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6c41a8713a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants