Add MLflow support and expose logging configuration in TrainingArgs#680
Add MLflow support and expose logging configuration in TrainingArgs#680
Conversation
📝 WalkthroughWalkthroughAdds optional MLflow, Weights & Biases, and TensorBoard logging: new TrainingArgs fields, MLflowHandler and MLflow wiring in the metric logger, expanded setup_metric_logger signature and CLI flags, optional dependency files, and README install notes. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI as main_ds.py
participant Setup as setup_metric_logger()
participant Logger as MetricLogger
participant MLflow as MLflow
participant WandB as WandB
participant TB as TensorBoard
User->>CLI: start training with logging flags
CLI->>Setup: setup_metric_logger(output_dir, mlflow_..., wandb_..., tensorboard_log_dir)
Setup->>Logger: configure handlers (MLflowHandler, WandB, TensorBoard, Async)
Logger->>MLflow: _setup() -> set_tracking_uri / set_experiment / start_run
Logger->>WandB: init run (project, entity, run_name)
Logger->>TB: create writer (log_dir)
CLI->>Logger: emit metrics/hparams LogRecord
Logger->>MLflow: emit -> log_metrics / log_params
Logger->>WandB: emit -> log metrics/params
Logger->>TB: emit -> write scalars
CLI->>Logger: shutdown
Logger->>MLflow: close -> end_run()
Logger->>WandB: finish run
Logger->>TB: close writer
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through
flat kwargs in sft(), osft(), and lora_sft() convenience functions.
## New Parameters
- `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"])
- `run_name`: Run name with placeholder support ({time}, {rank})
- `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- `logging_steps`: How often to log metrics
- `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration
- `tensorboard_log_dir`: TensorBoard output directory
- `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration
## Backend Support
| Logger | SFT | OSFT | LoRA |
|-------------|-----|------|------|
| wandb | Yes | Yes | Yes |
| tensorboard | Yes | No | Yes |
| mlflow | Yes | No | Yes |
| jsonl | Yes | Yes | No |
OSFT emits warnings for unsupported loggers/params and continues.
Depends on: instructlab/training#680
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/instructlab/training/main_ds.py`:
- Around line 275-283: The call to setup_metric_logger in main() uses
unnecessary defensive getattr() for mlflow/wandb fields; replace getattr(args,
"mlflow_tracking_uri", None), getattr(args, "mlflow_experiment_name", None),
getattr(args, "wandb_project", None), and getattr(args, "wandb_entity", None)
with direct attribute access args.mlflow_tracking_uri,
args.mlflow_experiment_name, args.wandb_project, and args.wandb_entity
respectively so it matches the pattern used in run_training() and with
train_args.
🧹 Nitpick comments (2)
src/instructlab/training/logger.py (2)
638-665: Unusedlog_dirparameter.The
log_dirparameter is stored asself.log_dirbut never used in_setup()or elsewhere. The docstring mentions it's "used as artifact location" but the implementation doesn't pass it to MLflow. Either use it to set the artifact location or remove it to avoid confusion.♻️ Option 1: Use log_dir as artifact location
def _setup(self): """Initialize the MLflow run with the configured settings.""" if mlflow is None: msg = ( "Could not initialize MLflowHandler because package mlflow could not be imported.\n" "Please ensure it is installed by running 'pip install mlflow'" ) raise RuntimeError(msg) if self.tracking_uri: mlflow.set_tracking_uri(self.tracking_uri) if self.experiment_name: - mlflow.set_experiment(self.experiment_name) + mlflow.set_experiment( + self.experiment_name, + artifact_location=str(self.log_dir), + ) self._mlflow_run = mlflow.start_run( run_name=self.run_name, **self.mlflow_init_kwargs )♻️ Option 2: Remove unused parameter
def __init__( self, level: int = logging.INFO, run_name: str | None = None, - log_dir: str | os.PathLike = "logs", tracking_uri: str | None = None, experiment_name: str | None = None, **mlflow_init_kwargs: Any, ): """Initialize the MLflow logger and check for required dependencies. Args: level: The logging level for this handler run_name: Name of the run, can contain placeholders - log_dir: Directory where MLflow artifacts should be stored (used as artifact location) tracking_uri: MLflow tracking server URI (e.g., "http://localhost:5000") experiment_name: Name of the MLflow experiment **mlflow_init_kwargs: Additional keyword arguments passed to mlflow.start_run() """ super().__init__(level) self.run_name = _substitute_placeholders(run_name) - self.log_dir = Path(log_dir) self.tracking_uri = tracking_uri self.experiment_name = experiment_name self.mlflow_init_kwargs = mlflow_init_kwargs.copy() self._mlflow_run = NoneNote: If removing
log_dir, also updatesetup_metric_loggerto not pass it to the MLflow handler config.
711-721: Consider adding a debug log for skipped non-numeric metrics.Non-numeric values are silently skipped. For consistency with
TensorBoardHandler(which warns on type errors), consider adding a debug-level message to help users understand why certain values aren't appearing in MLflow metrics.♻️ Proposed change
# Filter to only numeric values for metrics metrics_dict = {} for k, v in flat_dict.items(): try: metrics_dict[k] = float(v) except (ValueError, TypeError): # Skip non-numeric values for metrics - pass + warnings.warn( + f"MLflowHandler skipping non-numeric metric '{k}' with value {type(v).__name__}", + stacklevel=2, + )
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/instructlab/training/logger.py (1)
862-979:⚠️ Potential issue | 🟡 MinorEnable MLflow when experiment/run name is provided (not only tracking URI).
Currently, MLflow logging won't activate if users only set
mlflow_experiment_nameormlflow_run_name, even though MLflow can operate with local filesystem storage when no tracking URI is specified. The detection logic should include these fields and theMLFLOW_EXPERIMENT_NAMEenvironment variable to align with the design pattern used for other backends (e.g., wandb is enabled whenwandb_projectis set).Proposed fix
- if mlflow_tracking_uri or os.environ.get("MLFLOW_TRACKING_URI"): + if ( + mlflow_tracking_uri + or mlflow_experiment_name + or mlflow_run_name + or os.environ.get("MLFLOW_TRACKING_URI") + or os.environ.get("MLFLOW_EXPERIMENT_NAME") + ): detected_loggers.append("mlflow")
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/instructlab/training/logger.py (2)
1-46:⚠️ Potential issue | 🟡 MinorUpdate the module docstring example to match the new setup API.
The example still uses
loggers=[...]andrun_name=..., butsetup_metric_loggernow relies on auto-detection and backend-specific kwargs. This will mislead users.📝 Suggested docstring fix
- # Setup logging with TensorBoard and wandb - setup_metric_logger( - loggers=["tensorboard", "wandb"], - run_name="my_training_run", - output_dir="logs" - ) + # Setup logging with TensorBoard and wandb (auto-detected) + setup_metric_logger( + output_dir="logs", + tensorboard_log_dir="logs/tensorboard", + wandb_project="my_project", + wandb_run_name="my_training_run", + )
862-985:⚠️ Potential issue | 🟠 MajorRun name no longer wired to async/tensorboard (likely regression).
setup_metric_loggerused to acceptrun_name, but it’s now missing and both async/tensorboard handlers hardcoderun_name=None. This means the newTrainingArgs.run_namewon’t affect those backends. If the intent is a global run name, please reintroduce it and apply as a fallback for backend-specific names.✅ Suggested wiring
def setup_metric_logger( output_dir, *, + run_name: str | None = None, mlflow_tracking_uri: str | None = None, mlflow_experiment_name: str | None = None, mlflow_run_name: str | None = None, wandb_project: str | None = None, wandb_entity: str | None = None, wandb_run_name: str | None = None, tensorboard_log_dir: str | None = None, ): @@ "handlers": { "async": { "()": AsyncStructuredHandler, "log_dir": output_dir, - "run_name": None, # Uses default template + "run_name": run_name, # Uses default template if None "filters": async_filters, }, "tensorboard": { "()": TensorBoardHandler, "log_dir": tensorboard_log_dir or output_dir, - "run_name": None, # Uses default template + "run_name": run_name, # Uses default template if None "filters": ["is_mapping", "is_rank0"], }, "wandb": { "()": WandbHandler, "log_dir": output_dir, - "run_name": wandb_run_name, + "run_name": wandb_run_name or run_name, "project": wandb_project, "entity": wandb_entity, "filters": ["is_mapping", "is_rank0"], }, "mlflow": { "()": MLflowHandler, - "run_name": mlflow_run_name, + "run_name": mlflow_run_name or run_name, "tracking_uri": mlflow_tracking_uri or os.environ.get("MLFLOW_TRACKING_URI"), "experiment_name": mlflow_experiment_name or os.environ.get("MLFLOW_EXPERIMENT_NAME"), "filters": ["is_mapping", "is_rank0"], },
🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 664-681: The _setup method currently calls mlflow.start_run()
unconditionally which raises if an MLflow run is already active; update _setup
to check mlflow.active_run() first and handle both cases: if mlflow.active_run()
exists, set self._mlflow_run = mlflow.active_run() to reuse it, otherwise call
mlflow.start_run(run_name=self.run_name, **self.mlflow_init_kwargs); optionally,
if you prefer nesting instead of reusing, call mlflow.start_run(...,
nested=True) when an active run exists—make this change around the existing
self._mlflow_run assignment and mlflow.start_run call in _setup.
Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through
flat kwargs in sft(), osft(), and lora_sft() convenience functions.
- `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"])
- `run_name`: Run name with placeholder support ({time}, {rank})
- `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- `logging_steps`: How often to log metrics
- `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration
- `tensorboard_log_dir`: TensorBoard output directory
- `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration
| Logger | SFT | OSFT | LoRA |
|-------------|-----|------|------|
| wandb | Yes | Yes | Yes |
| tensorboard | Yes | No | Yes |
| mlflow | Yes | No | Yes |
| jsonl | Yes | Yes | No |
OSFT emits warnings for unsupported loggers/params and continues.
Depends on: instructlab/training#680
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Pass tensorboard_log_dir through to TrainingArgs
Update SFT backend to pass tensorboard_log_dir to TrainingArgs now that
instructlab-training supports configurable TensorBoard log directories.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable MLflow support for OSFT backend
- Add "mlflow" to SUPPORTED_LOGGERS
- Remove mlflow params from UNSUPPORTED_PARAMS
- Add mlflow_run_name parameter throughout
- Update docstrings to reflect MLflow is now supported
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address PR review comments
- Replace hardcoded data path with required=True in sft_mlflow_example.py
- Add stacklevel=2 to warnings.warn calls in osft.py
- Rename ambiguous loop variable 'l' to 'lg' in sft.py
- Add warnings for unsupported logging params in sft.py backend
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add OSFT MLflow example script
Similar to sft_mlflow_example.py but for OSFT training with
unfreeze_rank_ratio and other OSFT-specific parameters.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add logging and experiment tracking documentation
Create comprehensive documentation for the unified logging API:
- MLflow, Weights & Biases, TensorBoard, and JSONL logging
- Configuration precedence (kwarg > env var > defaults)
- Backend-specific notes and supported loggers
- Example usage for all algorithms (SFT, OSFT, LoRA)
- Run naming with placeholder support
Update sidebar to include the new logging guide.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove duplicate warnings import in sft.py
The warnings module is already imported at module level (line 2).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address review nitpicks in docs and examples
- Add language specifier to fenced code block in logging.md
- Remove unused result variable in example scripts
- Remove unnecessary wandb_run_name from MLflow-focused OSFT example
- Add comment clarifying /dev/shm usage for Linux shared memory
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Auto-detect loggers from config parameters
Loggers are now automatically enabled based on their configuration:
- mlflow_tracking_uri or MLFLOW_TRACKING_URI env → enables MLflow
- wandb_project or WANDB_PROJECT env → enables W&B
- tensorboard_log_dir → enables TensorBoard (SFT only)
The 'loggers' parameter is deprecated and emits a DeprecationWarning
when used. Updated API documentation with new logging configuration
sections and removed outdated logging guide.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address PR review feedback
- Fix markdown table alignment in docs for markdownlint compliance
- Change osft_mlflow_example.py default checkpoint dir from /tmp to ./
- Document that run_name is not supported by mini-trainer backend
- Add stacklevel=2 to block_size warning in osft.py
- Update loggers docstring to indicate deprecated status
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
remove extraneous api kwargs
removes more extraneous api additions
reintroduce mlflow_run_name into lora
docstrings
drop formatting changes
read envs when present
review comments
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 732-737: The handler currently always calls mlflow.end_run() in
close(), which can end runs it didn't start; add a boolean flag (e.g.,
self._owns_mlflow_run) that is set to True only when this handler creates/starts
an MLflow run (where mlflow.start_run() is invoked or when self._mlflow_run is
assigned) and False when reusing an existing active run, then modify close() to
call mlflow.end_run() only if self._owns_mlflow_run is True and clear both
self._mlflow_run and self._owns_mlflow_run during cleanup; ensure the flag is
updated wherever the run is started or adopted so ownership is tracked
correctly.
🧹 Nitpick comments (1)
src/instructlab/training/logger.py (1)
942-946: Simplify the async logger inclusion logic.The current logic is a bit convoluted. When
detected_loggersis empty, line 943 setsloggers = ["async"], then lines 945-946 check if async is not in loggers (which it already is). Whendetected_loggersis non-empty, it duplicates the list reference then appends.♻️ Proposed simplification
- # Always include async logger for file logging - loggers = detected_loggers if detected_loggers else ["async"] - # Also include async logger alongside other loggers for file-based logging - if detected_loggers and "async" not in loggers: - loggers.append("async") + # Always include async logger for file-based logging + loggers = detected_loggers + ["async"] if detected_loggers else ["async"]Or even simpler:
- # Always include async logger for file logging - loggers = detected_loggers if detected_loggers else ["async"] - # Also include async logger alongside other loggers for file-based logging - if detected_loggers and "async" not in loggers: - loggers.append("async") + # Always include async logger for file-based logging alongside other loggers + loggers = [*detected_loggers, "async"]
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 873-953: setup_metric_logger currently auto-enables backends based
only on backend-specific args and unconditionally ensures "async" is present,
and it doesn't propagate a global run_name to all backends; update it to accept
and honor an explicit logger selection (e.g., logger_type or logger_list
argument) so callers can request "tensorboard" or "async" even when
backend-specific fields are empty, only append "async" when explicitly requested
(or present in the explicit list), and ensure a global run_name parameter (or
run_name fallback) is forwarded into backend-specific settings (mlflow_run_name,
wandb_run_name, tensorboard handler config and any async file logger) so every
backend receives the same run name; change logic around detected_loggers/loggers
and the wiring to backend initializers (functions that configure
mlflow/wandb/tensorboard and the async handler) accordingly and apply the same
fix to the analogous block referenced around 968-997.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@pyproject.toml`:
- Around line 51-53: The pyproject defines optional extras
optional-dependencies.mlflow, optional-dependencies.wandb, and
optional-dependencies.tensorboard but the referenced files
requirements-mlflow.txt, requirements-wandb.txt, and
requirements-tensorboard.txt are missing; add those three files to the repo and
populate each with the correct dependency pins (e.g., mlflow and its compatible
extras in requirements-mlflow.txt, wandb and its compatible version in
requirements-wandb.txt, and tensorboard or tensorboard-related packages in
requirements-tensorboard.txt) following the same pinning style as the existing
optional requirements files (cuda/rocm/hpu/deepspeed) so pip install
'instructlab-training[mlflow]' / '[wandb]' / '[tensorboard]' succeeds.
…logging - Add tensorboard_log_dir field to TrainingArgs in config.py - Update setup_metric_logger to use tensorboard_log_dir when provided - Add CLI argument for tensorboard_log_dir - Wire tensorboard_log_dir through run_training() to subprocess command This allows users to specify a custom directory for TensorBoard logs, defaulting to output_dir if not specified. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace defensive getattr() with direct attribute access in main_ds.py since args are guaranteed to exist from argparse defaults - Remove unused log_dir parameter from MLflowHandler - Add debug logging for non-numeric metrics skipped by MLflowHandler Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
037454f to
3f7bfc2
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 891-896: The doc note incorrectly states the async logger default
template; either update the docstring to state that AsyncStructuredHandler uses
"training_params_and_metrics_global{rank}" when no run name is provided, or
change the AsyncStructuredHandler default to match "{time}_rank{rank}". Locate
the AsyncStructuredHandler class and its default run-name logic (and the
surrounding note in logger.py) and make them consistent: modify the note text to
the correct template string or update the AsyncStructuredHandler default
run-name value so both the implementation and the comment match.
- Around line 674-689: The _setup() logic currently sets mlflow.set_tracking_uri
and mlflow.set_experiment before checking mlflow.active_run(), causing explicit
tracking_uri and experiment_name in the handler to be silently ignored when an
active run exists; modify _setup() to check mlflow.active_run() first, and if
active is not None and (self.tracking_uri or self.experiment_name) emit a clear
warning (e.g., self.logger.warning or warnings.warn) that those settings will be
ignored, and only call mlflow.set_tracking_uri and mlflow.set_experiment when
starting a new run (i.e., in the branch where you assign self._mlflow_run =
mlflow.start_run and set self._owns_mlflow_run).
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/instructlab/training/logger.py (1)
883-1012:⚠️ Potential issue | 🟡 MinorAdd
WANDB_ENTITYto the auto-detection condition for wandb logging.The environment variable names are correct (
MLFLOW_TRACKING_URI,MLFLOW_EXPERIMENT_NAME,WANDB_PROJECTall verified), but the auto-detection logic for wandb is incomplete. It checkswandb_projectandos.environ.get("WANDB_PROJECT")but omitsos.environ.get("WANDB_ENTITY"). Since the WandbHandler accepts the entity parameter and WANDB_ENTITY is an official W&B environment variable, the detection condition should also check for it:if wandb_project or os.environ.get("WANDB_PROJECT") or os.environ.get("WANDB_ENTITY"): detected_loggers.append("wandb")This ensures wandb logging is auto-enabled when the WANDB_ENTITY environment variable is set, consistent with how MLflow auto-detection includes multiple environment variables.
Maxusmusti
left a comment
There was a problem hiding this comment.
LGTM, all my review comments have been addressed
Summary
TrainingArgsfor programmatic API usagewandb_projectandwandb_entityfields toTrainingArgsfor consistencyChanges
New
TrainingArgsfieldslogger_typestr"async"tensorboard,wandb,mlflow,asyncrun_namestr | NoneNone{time},{rank}, etc.)mlflow_tracking_uristr | NoneNonemlflow_experiment_namestr | NoneNonewandb_projectstr | NoneNonewandb_entitystr | NoneNoneNew
MLflowHandlerclassImplements the same interface as
TensorBoardHandlerandWandbHandler:mlflow.log_metrics()mlflow.log_params()tracking_uriandexperiment_nameconfigurationMLFLOW_TRACKING_URIandMLFLOW_EXPERIMENT_NAMEenv varsUpdated
run_training()APIPreviously,
run_training()hardcoded the logger to"async". Now it reads fromTrainingArgs:Example Usage
Test plan
wandb_project/wandb_entityfieldsasynclogger_typeenables multiple backends simultaneously🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Chores