Add MLflow support and expose logging configuration in TrainingArgs by RobotSail · Pull Request #680 · instructlab/training

RobotSail · 2026-01-27T14:33:45Z

Summary

Adds MLflow as a new logging backend alongside TensorBoard, W&B, and async JSONL
Exposes logging configuration through TrainingArgs for programmatic API usage
Adds wandb_project and wandb_entity fields to TrainingArgs for consistency

Changes

New `TrainingArgs` fields

Field	Type	Default	Description
`logger_type`	`str`	`"async"`	Comma-separated loggers: `tensorboard`, `wandb`, `mlflow`, `async`
`run_name`	`str \| None`	`None`	Run name with placeholder support (`{time}`, `{rank}`, etc.)
`mlflow_tracking_uri`	`str \| None`	`None`	MLflow tracking server URI
`mlflow_experiment_name`	`str \| None`	`None`	MLflow experiment name
`wandb_project`	`str \| None`	`None`	W&B project name
`wandb_entity`	`str \| None`	`None`	W&B team/entity name

New `MLflowHandler` class

Implements the same interface as TensorBoardHandler and WandbHandler:

Logs metrics via mlflow.log_metrics()
Logs hyperparameters via mlflow.log_params()
Supports tracking_uri and experiment_name configuration
Falls back to MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_NAME env vars

Updated `run_training()` API

Previously, run_training() hardcoded the logger to "async". Now it reads from TrainingArgs:

# Before
setup_metric_logger("async", None, train_args.ckpt_output_dir)

# After
setup_metric_logger(
    train_args.logger_type,
    train_args.run_name,
    train_args.ckpt_output_dir,
    mlflow_tracking_uri=train_args.mlflow_tracking_uri,
    mlflow_experiment_name=train_args.mlflow_experiment_name,
    wandb_project=train_args.wandb_project,
    wandb_entity=train_args.wandb_entity,
)

Example Usage

from instructlab.training import run_training, TrainingArgs, TorchrunArgs

train_args = TrainingArgs(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    data_path="./data.jsonl",
    ckpt_output_dir="./outputs",
    # ... other required fields ...
    
    # New logging configuration
    logger_type="wandb,mlflow",
    run_name="experiment-{time}",
    mlflow_tracking_uri="http://localhost:5000",
    mlflow_experiment_name="my-experiments",
    wandb_project="my-project",
)

run_training(torch_args, train_args)

Test plan

Verify MLflow handler logs metrics correctly to a local MLflow server
Verify W&B logging still works with new wandb_project/wandb_entity fields
Verify backward compatibility: existing code without logging params defaults to async
Verify comma-separated logger_type enables multiple backends simultaneously

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Multi-backend metric logging: MLflow, Weights & Biases (W&B), and TensorBoard support plus per-backend run/experiment naming and auto-detection.
- New CLI/training options to configure MLflow, W&B, and TensorBoard from the command line or training config.
Documentation
- README and examples updated to demonstrate multi-backend logging and configuration.
Chores
- Added optional install entries and dedicated requirements files for MLflow, W&B, and TensorBoard.

coderabbitai · 2026-01-27T14:34:08Z

📝 Walkthrough

Walkthrough

Adds optional MLflow, Weights & Biases, and TensorBoard logging: new TrainingArgs fields, MLflowHandler and MLflow wiring in the metric logger, expanded setup_metric_logger signature and CLI flags, optional dependency files, and README install notes.

Changes

Cohort / File(s)	Summary
Config: Training arguments `src/instructlab/training/config.py`	Added seven optional `TrainingArgs` fields: `mlflow_tracking_uri`, `mlflow_experiment_name`, `mlflow_run_name`, `wandb_project`, `wandb_entity`, `wandb_run_name`, `tensorboard_log_dir`.
Logging backend & API `src/instructlab/training/logger.py`	Added `MLflowHandler` with safe-import guard; expanded `setup_metric_logger(...)` signature to accept MLflow/W&B/TensorBoard params; auto-detects and wires MLflow, WandB, TensorBoard, and async handlers; numeric-metric filtering and params logging implemented.
CLI / Entrypoint integration `src/instructlab/training/main_ds.py`	Added CLI flags for MLflow/W&B/TensorBoard; threaded new options into subprocess command and into `setup_metric_logger()` calls to propagate logging backend config.
Docs & packaging `README.md`, `pyproject.toml`, `requirements-mlflow.txt`, `requirements-wandb.txt`, `requirements-tensorboard.txt`	README: new optional logging dependencies section and install notes. `pyproject.toml`: added optional-dependencies entries referencing the three requirements files. Added three requirements files listing `mlflow`, `wandb`, and `tensorboard`.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as main_ds.py
    participant Setup as setup_metric_logger()
    participant Logger as MetricLogger
    participant MLflow as MLflow
    participant WandB as WandB
    participant TB as TensorBoard

    User->>CLI: start training with logging flags
    CLI->>Setup: setup_metric_logger(output_dir, mlflow_..., wandb_..., tensorboard_log_dir)
    Setup->>Logger: configure handlers (MLflowHandler, WandB, TensorBoard, Async)
    Logger->>MLflow: _setup() -> set_tracking_uri / set_experiment / start_run
    Logger->>WandB: init run (project, entity, run_name)
    Logger->>TB: create writer (log_dir)
    CLI->>Logger: emit metrics/hparams LogRecord
    Logger->>MLflow: emit -> log_metrics / log_params
    Logger->>WandB: emit -> log metrics/params
    Logger->>TB: emit -> write scalars
    CLI->>Logger: shutdown
    Logger->>MLflow: close -> end_run()
    Logger->>WandB: finish run
    Logger->>TB: close writer

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nudged the args and traced each line,

MLflow, WandB, TensorBoard — all in kind,
Metrics hopped in, params snug and neat,
Handlers twitched to every logged heartbeat,
A rabbit smiled as runs completed fine.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and accurately summarizes the main changes: adding MLflow support and exposing logging configuration in TrainingArgs, which are the primary objectives of the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 90.91% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch expose-mlflow

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through flat kwargs in sft(), osft(), and lora_sft() convenience functions. ## New Parameters - `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"]) - `run_name`: Run name with placeholder support ({time}, {rank}) - `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) - `logging_steps`: How often to log metrics - `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration - `tensorboard_log_dir`: TensorBoard output directory - `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration ## Backend Support | Logger | SFT | OSFT | LoRA | |-------------|-----|------|------| | wandb | Yes | Yes | Yes | | tensorboard | Yes | No | Yes | | mlflow | Yes | No | Yes | | jsonl | Yes | Yes | No | OSFT emits warnings for unsupported loggers/params and continues. Depends on: instructlab/training#680 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/instructlab/training/main_ds.py`:
- Around line 275-283: The call to setup_metric_logger in main() uses
unnecessary defensive getattr() for mlflow/wandb fields; replace getattr(args,
"mlflow_tracking_uri", None), getattr(args, "mlflow_experiment_name", None),
getattr(args, "wandb_project", None), and getattr(args, "wandb_entity", None)
with direct attribute access args.mlflow_tracking_uri,
args.mlflow_experiment_name, args.wandb_project, and args.wandb_entity
respectively so it matches the pattern used in run_training() and with
train_args.

🧹 Nitpick comments (2)

src/instructlab/training/logger.py (2)

638-665: Unused log_dir parameter.

The log_dir parameter is stored as self.log_dir but never used in _setup() or elsewhere. The docstring mentions it's "used as artifact location" but the implementation doesn't pass it to MLflow. Either use it to set the artifact location or remove it to avoid confusion.

♻️ Option 1: Use log_dir as artifact location

     def _setup(self):
         """Initialize the MLflow run with the configured settings."""
         if mlflow is None:
             msg = (
                 "Could not initialize MLflowHandler because package mlflow could not be imported.\n"
                 "Please ensure it is installed by running 'pip install mlflow'"
             )
             raise RuntimeError(msg)

         if self.tracking_uri:
             mlflow.set_tracking_uri(self.tracking_uri)

         if self.experiment_name:
-            mlflow.set_experiment(self.experiment_name)
+            mlflow.set_experiment(
+                self.experiment_name,
+                artifact_location=str(self.log_dir),
+            )

         self._mlflow_run = mlflow.start_run(
             run_name=self.run_name, **self.mlflow_init_kwargs
         )

♻️ Option 2: Remove unused parameter

     def __init__(
         self,
         level: int = logging.INFO,
         run_name: str | None = None,
-        log_dir: str | os.PathLike = "logs",
         tracking_uri: str | None = None,
         experiment_name: str | None = None,
         **mlflow_init_kwargs: Any,
     ):
         """Initialize the MLflow logger and check for required dependencies.

         Args:
             level: The logging level for this handler
             run_name: Name of the run, can contain placeholders
-            log_dir: Directory where MLflow artifacts should be stored (used as artifact location)
             tracking_uri: MLflow tracking server URI (e.g., "http://localhost:5000")
             experiment_name: Name of the MLflow experiment
             **mlflow_init_kwargs: Additional keyword arguments passed to mlflow.start_run()
         """
         super().__init__(level)

         self.run_name = _substitute_placeholders(run_name)
-        self.log_dir = Path(log_dir)
         self.tracking_uri = tracking_uri
         self.experiment_name = experiment_name
         self.mlflow_init_kwargs = mlflow_init_kwargs.copy()

         self._mlflow_run = None

Note: If removing log_dir, also update setup_metric_logger to not pass it to the MLflow handler config.

711-721: Consider adding a debug log for skipped non-numeric metrics.

Non-numeric values are silently skipped. For consistency with TensorBoardHandler (which warns on type errors), consider adding a debug-level message to help users understand why certain values aren't appearing in MLflow metrics.

♻️ Proposed change

         # Filter to only numeric values for metrics
         metrics_dict = {}
         for k, v in flat_dict.items():
             try:
                 metrics_dict[k] = float(v)
             except (ValueError, TypeError):
                 # Skip non-numeric values for metrics
-                pass
+                warnings.warn(
+                    f"MLflowHandler skipping non-numeric metric '{k}' with value {type(v).__name__}",
+                    stacklevel=2,
+                )

src/instructlab/training/main_ds.py

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/instructlab/training/logger.py (1)
862-979: ⚠️ Potential issue | 🟡 Minor

Enable MLflow when experiment/run name is provided (not only tracking URI).

Currently, MLflow logging won't activate if users only set mlflow_experiment_name or mlflow_run_name, even though MLflow can operate with local filesystem storage when no tracking URI is specified. The detection logic should include these fields and the MLFLOW_EXPERIMENT_NAME environment variable to align with the design pattern used for other backends (e.g., wandb is enabled when wandb_project is set).
Proposed fix
-    if mlflow_tracking_uri or os.environ.get("MLFLOW_TRACKING_URI"):
+    if (
+        mlflow_tracking_uri
+        or mlflow_experiment_name
+        or mlflow_run_name
+        or os.environ.get("MLFLOW_TRACKING_URI")
+        or os.environ.get("MLFLOW_EXPERIMENT_NAME")
+    ):
         detected_loggers.append("mlflow")

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/instructlab/training/logger.py (2)

1-46: ⚠️ Potential issue | 🟡 Minor

Update the module docstring example to match the new setup API.

The example still uses loggers=[...] and run_name=..., but setup_metric_logger now relies on auto-detection and backend-specific kwargs. This will mislead users.

📝 Suggested docstring fix

-    # Setup logging with TensorBoard and wandb
-    setup_metric_logger(
-        loggers=["tensorboard", "wandb"],
-        run_name="my_training_run",
-        output_dir="logs"
-    )
+    # Setup logging with TensorBoard and wandb (auto-detected)
+    setup_metric_logger(
+        output_dir="logs",
+        tensorboard_log_dir="logs/tensorboard",
+        wandb_project="my_project",
+        wandb_run_name="my_training_run",
+    )

862-985: ⚠️ Potential issue | 🟠 Major

Run name no longer wired to async/tensorboard (likely regression).

setup_metric_logger used to accept run_name, but it’s now missing and both async/tensorboard handlers hardcode run_name=None. This means the new TrainingArgs.run_name won’t affect those backends. If the intent is a global run name, please reintroduce it and apply as a fallback for backend-specific names.

✅ Suggested wiring

 def setup_metric_logger(
     output_dir,
     *,
+    run_name: str | None = None,
     mlflow_tracking_uri: str | None = None,
     mlflow_experiment_name: str | None = None,
     mlflow_run_name: str | None = None,
     wandb_project: str | None = None,
     wandb_entity: str | None = None,
     wandb_run_name: str | None = None,
     tensorboard_log_dir: str | None = None,
 ):
@@
         "handlers": {
             "async": {
                 "()": AsyncStructuredHandler,
                 "log_dir": output_dir,
-                "run_name": None,  # Uses default template
+                "run_name": run_name,  # Uses default template if None
                 "filters": async_filters,
             },
             "tensorboard": {
                 "()": TensorBoardHandler,
                 "log_dir": tensorboard_log_dir or output_dir,
-                "run_name": None,  # Uses default template
+                "run_name": run_name,  # Uses default template if None
                 "filters": ["is_mapping", "is_rank0"],
             },
             "wandb": {
                 "()": WandbHandler,
                 "log_dir": output_dir,
-                "run_name": wandb_run_name,
+                "run_name": wandb_run_name or run_name,
                 "project": wandb_project,
                 "entity": wandb_entity,
                 "filters": ["is_mapping", "is_rank0"],
             },
             "mlflow": {
                 "()": MLflowHandler,
-                "run_name": mlflow_run_name,
+                "run_name": mlflow_run_name or run_name,
                 "tracking_uri": mlflow_tracking_uri
                 or os.environ.get("MLFLOW_TRACKING_URI"),
                 "experiment_name": mlflow_experiment_name
                 or os.environ.get("MLFLOW_EXPERIMENT_NAME"),
                 "filters": ["is_mapping", "is_rank0"],
             },

🤖 Fix all issues with AI agents

In `@src/instructlab/training/logger.py`:
- Around line 664-681: The _setup method currently calls mlflow.start_run()
unconditionally which raises if an MLflow run is already active; update _setup
to check mlflow.active_run() first and handle both cases: if mlflow.active_run()
exists, set self._mlflow_run = mlflow.active_run() to reuse it, otherwise call
mlflow.start_run(run_name=self.run_name, **self.mlflow_init_kwargs); optionally,
if you prefer nesting instead of reusing, call mlflow.start_run(...,
nested=True) when an active run exists—make this change around the existing
self._mlflow_run assignment and mlflow.start_run call in _setup.

src/instructlab/training/logger.py

Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through flat kwargs in sft(), osft(), and lora_sft() convenience functions. - `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"]) - `run_name`: Run name with placeholder support ({time}, {rank}) - `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) - `logging_steps`: How often to log metrics - `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration - `tensorboard_log_dir`: TensorBoard output directory - `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration | Logger | SFT | OSFT | LoRA | |-------------|-----|------|------| | wandb | Yes | Yes | Yes | | tensorboard | Yes | No | Yes | | mlflow | Yes | No | Yes | | jsonl | Yes | Yes | No | OSFT emits warnings for unsupported loggers/params and continues. Depends on: instructlab/training#680 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Pass tensorboard_log_dir through to TrainingArgs Update SFT backend to pass tensorboard_log_dir to TrainingArgs now that instructlab-training supports configurable TensorBoard log directories. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Enable MLflow support for OSFT backend - Add "mlflow" to SUPPORTED_LOGGERS - Remove mlflow params from UNSUPPORTED_PARAMS - Add mlflow_run_name parameter throughout - Update docstrings to reflect MLflow is now supported Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Address PR review comments - Replace hardcoded data path with required=True in sft_mlflow_example.py - Add stacklevel=2 to warnings.warn calls in osft.py - Rename ambiguous loop variable 'l' to 'lg' in sft.py - Add warnings for unsupported logging params in sft.py backend Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Add OSFT MLflow example script Similar to sft_mlflow_example.py but for OSFT training with unfreeze_rank_ratio and other OSFT-specific parameters. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Add logging and experiment tracking documentation Create comprehensive documentation for the unified logging API: - MLflow, Weights & Biases, TensorBoard, and JSONL logging - Configuration precedence (kwarg > env var > defaults) - Backend-specific notes and supported loggers - Example usage for all algorithms (SFT, OSFT, LoRA) - Run naming with placeholder support Update sidebar to include the new logging guide. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Remove duplicate warnings import in sft.py The warnings module is already imported at module level (line 2). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Address review nitpicks in docs and examples - Add language specifier to fenced code block in logging.md - Remove unused result variable in example scripts - Remove unnecessary wandb_run_name from MLflow-focused OSFT example - Add comment clarifying /dev/shm usage for Linux shared memory Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Auto-detect loggers from config parameters Loggers are now automatically enabled based on their configuration: - mlflow_tracking_uri or MLFLOW_TRACKING_URI env → enables MLflow - wandb_project or WANDB_PROJECT env → enables W&B - tensorboard_log_dir → enables TensorBoard (SFT only) The 'loggers' parameter is deprecated and emits a DeprecationWarning when used. Updated API documentation with new logging configuration sections and removed outdated logging guide. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Address PR review feedback - Fix markdown table alignment in docs for markdownlint compliance - Change osft_mlflow_example.py default checkpoint dir from /tmp to ./ - Document that run_name is not supported by mini-trainer backend - Add stacklevel=2 to block_size warning in osft.py - Update loggers docstring to indicate deprecated status Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> remove extraneous api kwargs removes more extraneous api additions reintroduce mlflow_run_name into lora docstrings drop formatting changes read envs when present review comments

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/instructlab/training/logger.py`:
- Around line 732-737: The handler currently always calls mlflow.end_run() in
close(), which can end runs it didn't start; add a boolean flag (e.g.,
self._owns_mlflow_run) that is set to True only when this handler creates/starts
an MLflow run (where mlflow.start_run() is invoked or when self._mlflow_run is
assigned) and False when reusing an existing active run, then modify close() to
call mlflow.end_run() only if self._owns_mlflow_run is True and clear both
self._mlflow_run and self._owns_mlflow_run during cleanup; ensure the flag is
updated wherever the run is started or adopted so ownership is tracked
correctly.

🧹 Nitpick comments (1)

src/instructlab/training/logger.py (1)

942-946: Simplify the async logger inclusion logic.

The current logic is a bit convoluted. When detected_loggers is empty, line 943 sets loggers = ["async"], then lines 945-946 check if async is not in loggers (which it already is). When detected_loggers is non-empty, it duplicates the list reference then appends.

♻️ Proposed simplification

-    # Always include async logger for file logging
-    loggers = detected_loggers if detected_loggers else ["async"]
-    # Also include async logger alongside other loggers for file-based logging
-    if detected_loggers and "async" not in loggers:
-        loggers.append("async")
+    # Always include async logger for file-based logging
+    loggers = detected_loggers + ["async"] if detected_loggers else ["async"]

Or even simpler:

-    # Always include async logger for file logging
-    loggers = detected_loggers if detected_loggers else ["async"]
-    # Also include async logger alongside other loggers for file-based logging
-    if detected_loggers and "async" not in loggers:
-        loggers.append("async")
+    # Always include async logger for file-based logging alongside other loggers
+    loggers = [*detected_loggers, "async"]

src/instructlab/training/logger.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/instructlab/training/logger.py`:
- Around line 873-953: setup_metric_logger currently auto-enables backends based
only on backend-specific args and unconditionally ensures "async" is present,
and it doesn't propagate a global run_name to all backends; update it to accept
and honor an explicit logger selection (e.g., logger_type or logger_list
argument) so callers can request "tensorboard" or "async" even when
backend-specific fields are empty, only append "async" when explicitly requested
(or present in the explicit list), and ensure a global run_name parameter (or
run_name fallback) is forwarded into backend-specific settings (mlflow_run_name,
wandb_run_name, tensorboard handler config and any async file logger) so every
backend receives the same run name; change logic around detected_loggers/loggers
and the wiring to backend initializers (functions that configure
mlflow/wandb/tensorboard and the async handler) accordingly and apply the same
fix to the analogous block referenced around 968-997.

src/instructlab/training/logger.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@pyproject.toml`:
- Around line 51-53: The pyproject defines optional extras
optional-dependencies.mlflow, optional-dependencies.wandb, and
optional-dependencies.tensorboard but the referenced files
requirements-mlflow.txt, requirements-wandb.txt, and
requirements-tensorboard.txt are missing; add those three files to the repo and
populate each with the correct dependency pins (e.g., mlflow and its compatible
extras in requirements-mlflow.txt, wandb and its compatible version in
requirements-wandb.txt, and tensorboard or tensorboard-related packages in
requirements-tensorboard.txt) following the same pinning style as the existing
optional requirements files (cuda/rocm/hpu/deepspeed) so pip install
'instructlab-training[mlflow]' / '[wandb]' / '[tensorboard]' succeeds.

pyproject.toml

…logging - Add tensorboard_log_dir field to TrainingArgs in config.py - Update setup_metric_logger to use tensorboard_log_dir when provided - Add CLI argument for tensorboard_log_dir - Wire tensorboard_log_dir through run_training() to subprocess command This allows users to specify a custom directory for TensorBoard logs, defaulting to output_dir if not specified. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Replace defensive getattr() with direct attribute access in main_ds.py since args are guaranteed to exist from argparse defaults - Remove unused log_dir parameter from MLflowHandler - Add debug logging for non-numeric metrics skipped by MLflowHandler Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@src/instructlab/training/logger.py`:
- Around line 891-896: The doc note incorrectly states the async logger default
template; either update the docstring to state that AsyncStructuredHandler uses
"training_params_and_metrics_global{rank}" when no run name is provided, or
change the AsyncStructuredHandler default to match "{time}_rank{rank}". Locate
the AsyncStructuredHandler class and its default run-name logic (and the
surrounding note in logger.py) and make them consistent: modify the note text to
the correct template string or update the AsyncStructuredHandler default
run-name value so both the implementation and the comment match.
- Around line 674-689: The _setup() logic currently sets mlflow.set_tracking_uri
and mlflow.set_experiment before checking mlflow.active_run(), causing explicit
tracking_uri and experiment_name in the handler to be silently ignored when an
active run exists; modify _setup() to check mlflow.active_run() first, and if
active is not None and (self.tracking_uri or self.experiment_name) emit a clear
warning (e.g., self.logger.warning or warnings.warn) that those settings will be
ignored, and only call mlflow.set_tracking_uri and mlflow.set_experiment when
starting a new run (i.e., in the branch where you assign self._mlflow_run =
mlflow.start_run and set self._owns_mlflow_run).

src/instructlab/training/logger.py

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/instructlab/training/logger.py (1)
883-1012: ⚠️ Potential issue | 🟡 Minor

Add WANDB_ENTITY to the auto-detection condition for wandb logging.

The environment variable names are correct (MLFLOW_TRACKING_URI, MLFLOW_EXPERIMENT_NAME, WANDB_PROJECT all verified), but the auto-detection logic for wandb is incomplete. It checks wandb_project and os.environ.get("WANDB_PROJECT") but omits os.environ.get("WANDB_ENTITY"). Since the WandbHandler accepts the entity parameter and WANDB_ENTITY is an official W&B environment variable, the detection condition should also check for it:
if wandb_project or os.environ.get("WANDB_PROJECT") or os.environ.get("WANDB_ENTITY"):
    detected_loggers.append("wandb")
This ensures wandb logging is auto-enabled when the WANDB_ENTITY environment variable is set, consistent with how MLflow auto-detection includes multiple environment variables.

Maxusmusti

LGTM, all my review comments have been addressed

mergify bot added the ci-failure label Jan 27, 2026

RobotSail mentioned this pull request Jan 27, 2026

Add unified logging configuration for all algorithms Red-Hat-AI-Innovation-Team/training_hub#34

Open

4 tasks

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

src/instructlab/training/main_ds.py Show resolved Hide resolved

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

mergify bot added ci-failure and removed ci-failure labels Feb 3, 2026

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

src/instructlab/training/logger.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

src/instructlab/training/logger.py Show resolved Hide resolved

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

src/instructlab/training/logger.py Show resolved Hide resolved

mergify bot added the documentation Improvements or additions to documentation label Feb 3, 2026

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

pyproject.toml Show resolved Hide resolved

mergify bot added dependencies Pull requests that update a dependency file ci-failure and removed ci-failure labels Feb 3, 2026

RobotSail and others added 11 commits February 3, 2026 22:20

add support for mlflow

63d5824

fix formatting changes

4e49652

removes generic run_name and logger_type kwargs

9dbd48c

review comments

4f2f0fe

something something mlflow active runs

a3471d7

review comments

73d11c4

coderabbit

8517b74

adds install targets for logging backends

22dd650

add targets for loggers

3f7bfc2

RobotSail force-pushed the expose-mlflow branch from 037454f to 3f7bfc2 Compare February 3, 2026 22:20

messaging

fbd91c6

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

src/instructlab/training/logger.py Show resolved Hide resolved

src/instructlab/training/logger.py Outdated Show resolved Hide resolved

comments

4fe3ab2

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

interim changes

15f9357

mergify bot added ci-failure and removed ci-failure labels Feb 4, 2026

Maxusmusti approved these changes Feb 4, 2026

View reviewed changes

mergify bot added the one-approval label Feb 4, 2026

RobotSail merged commit 0c47c97 into main Feb 5, 2026
15 of 19 checks passed

RobotSail deleted the expose-mlflow branch February 5, 2026 00:05

Conversation

RobotSail commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New TrainingArgs fields

New MLflowHandler class

Updated run_training() API

Example Usage

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Maxusmusti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RobotSail commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

New `TrainingArgs` fields

New `MLflowHandler` class

Updated `run_training()` API

coderabbitai bot commented Jan 27, 2026 •

edited

Loading