CI: Consolidate test matrix configurations into ci/test-matrix.json with hard-coded values, optimized checkout, and prepared Windows self-hosted runner migration by Copilot · Pull Request #889 · NVIDIA/cuda-python

Copilot · 2025-08-22T01:52:54Z

This PR consolidates the hardcoded test matrices from GitHub workflow files into a centralized JSON configuration file, addressing the issue where test matrix data was scattered and embedded within execution logic.

Changes Made

New file: ci/test-matrix.json

Consolidated all test matrix configurations from both Linux and Windows workflows
All GPU and ARCH values are now hard-coded with no runtime evaluation
Each entry contains exactly 6 fields: ARCH, PY_VER, CUDA_VER, LOCAL_CTK, GPU, DRIVER
Separate entries for amd64 (l4 GPU) and arm64 (a100 GPU) architectures for Linux
Windows entries configured with mixed T4/L4 GPU support and "latest" driver
Added comprehensive documentation and sorting guidelines

Updated workflows:

.github/workflows/test-wheel-linux.yml: Replaced 58-line hardcoded matrix with JSON-based logic, removed runtime GPU substitution, optimized checkout with fetch-depth: 1, configured self-hosted runners with proxy cache support
.github/workflows/test-wheel-windows.yml: Replaced 12-line hardcoded matrix with JSON-based logic, added missing GPU/DRIVER fields, prepared for future self-hosted runner migration with TODO comments, optimized checkout with fetch-depth: 1 for compute-matrix job

Matrix Structure

The JSON now contains explicit entries for each architecture combination:

Linux pull-request: 20 entries (10 amd64 + 10 arm64)
Linux nightly: 48 entries (24 amd64 + 24 arm64)
Linux special runners: 1 entry (H100 for amd64)
Windows pull-request/nightly: 4 entries each (amd64 only with mixed T4/L4 GPUs)

All entries follow the consistent 6-field structure with hard-coded values, eliminating runtime evaluation and improving maintainability.

Self-Hosted Runner Configuration

Linux workflow uses configurable self-hosted runners:

Pattern: "linux-${{ matrix.ARCH }}-gpu-${{ matrix.GPU }}-${{ matrix.DRIVER }}-1"
Includes proxy cache setup using nv-gha-runners/setup-proxy-cache@main

Windows workflow remains on GitHub-hosted runners with migration preparation:

Current: 'cuda-python-windows-gpu-github'
Prepared configuration: "windows-${{ matrix.ARCH }}-gpu-${{ matrix.GPU }}-${{ matrix.DRIVER }}-1" (commented with TODO)
Proxy cache setup ready (commented with TODO)
Tool installation blocks marked for removal once self-hosted runners are available

Performance Optimizations

Optimized checkout: Both workflows use fetch-depth: 1 for compute-matrix jobs since git history and tags are not needed
Linux proxy cache: Active proxy cache support improves build performance on self-hosted runners
Windows future-ready: All optimizations prepared for easy activation when self-hosted runners become available

Windows GPU Configuration

Windows test matrix strategically uses both T4 and L4 GPUs to maintain compatibility while leveraging newer capabilities:

Python 3.12 + CUDA 12.9.0: L4 (LOCAL_CTK=0), T4 (LOCAL_CTK=1)
Python 3.13 + CUDA 13.0.0: T4 (LOCAL_CTK=0), L4 (LOCAL_CTK=1)

This balanced approach ensures testing coverage across both GPU types while favoring L4/T4 over A100 as recommended.

Migration Strategy

The Windows workflow includes comprehensive TODO comments and prepared configurations for seamless migration to self-hosted runners:

Self-hosted runner configuration ready to uncomment
Proxy cache setup prepared
Tool installation blocks marked for removal
All matrix configurations already compatible

Preserved Functionality

All existing behavior is maintained:

Architecture-specific GPU assignments (l4 for amd64, a100 for arm64 on Linux; mixed T4/L4 for Windows)
Special H100 runner configuration for Linux amd64 only
Branch builds automatically use nightly matrix
Matrix filtering support via matrix_filter input
All original test combinations and configurations
Full compatibility with existing Windows GitHub-hosted infrastructure

Benefits

Separation of concerns: Test data is now separate from execution logic
No runtime evaluation: All values are pre-determined and explicit
Easier maintenance: Test matrices are centralized and easier to locate
Better documentation: JSON file includes clear structure and usage notes
Consistent structure: Every entry has the same 6 required fields
Migration ready: Windows workflow prepared for future self-hosted runner adoption
Performance optimized: Reduced checkout overhead and Linux proxy cache support
Balanced GPU usage: Windows testing spans both T4 and L4 hardware strategically

The implementation maintains complete backward compatibility while significantly improving maintainability, performance, and following the principle of separating data from execution logic. The Windows workflow is fully prepared for future migration to self-hosted runners when they become available.

Fixes #888.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

copy-pr-bot · 2025-08-22T01:52:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

ci/test-matrix.json

…er entry Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

ci/test-matrix.json

…re self-hosted runners Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

.github/workflows/test-wheel-windows.yml

leofang · 2025-08-24T03:59:49Z

/ok to test eed0b71

ci/test-matrix.json

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

…or future self-hosted migration Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

leofang

Now that we confirmed the self-hosted Win runners work, I have reverted temporarily back to GH-hosted runners to make this PR mergeable. I still want the json refactoring to be merged.

leofang · 2025-09-05T02:50:37Z

pre-commit.ci autofix

leofang · 2025-09-05T02:51:37Z

/ok to test 905ad53

leofang · 2025-09-05T14:56:18Z

This is ready.

cryos

LGTM, one small nit inline, but I think it is fine.

ci/test-matrix.json

.github/workflows/test-wheel-windows.yml

leofang · 2025-09-05T18:02:59Z

Thanks, @cryos @rparolin @copilot!

github-actions · 2025-09-05T18:17:02Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

github-actions · 2025-09-10T01:48:29Z

Backport failed for 12.9.x, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 12.9.x
git worktree add -d .worktree/backport-889-to-12.9.x origin/12.9.x
cd .worktree/backport-889-to-12.9.x
git switch --create backport-889-to-12.9.x
git cherry-pick -x b4644f2870afe1ed43e8094c8dfac5a99870702a

…ith hard-coded values, optimized checkout, and prepared Windows self-hosted runner migration (NVIDIA#889) * Initial plan * Consolidate test matrices from workflows into ci/test-matrix.json Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> * Hard-code all GPU and ARCH values in test-matrix.json with 6 fields per entry Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> * Update Windows test matrix with a100 GPU and latest-1 driver, configure self-hosted runners Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> * fix * Revert eed0b71 and change Windows DRIVER from latest-1 to latest Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> * Add proxy cache setup to Windows workflow for self-hosted runners Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> * Remove Git for Windows and gh CLI installation steps, add T4 GPU support to Windows matrix Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> * Set fetch-depth: 1 for checkout steps and favor L4/T4 over A100 GPUs for Windows testing Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> * Revert Windows workflow to GitHub-hosted runners with TODO comments for future self-hosted migration Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> * [pre-commit.ci] auto code formatting * Revert Win runner name change for now --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> Co-authored-by: Leo Fang <leof@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* bump all CI jobs to CUDA 12.9.1 * CI: Consolidate test matrix configurations into ci/test-matrix.json with hard-coded values, optimized checkout, and prepared Windows self-hosted runner migration (#889) * Initial plan * Consolidate test matrices from workflows into ci/test-matrix.json Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> * Hard-code all GPU and ARCH values in test-matrix.json with 6 fields per entry Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> * Update Windows test matrix with a100 GPU and latest-1 driver, configure self-hosted runners Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> * fix * Revert eed0b71 and change Windows DRIVER from latest-1 to latest Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> * Add proxy cache setup to Windows workflow for self-hosted runners Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> * Remove Git for Windows and gh CLI installation steps, add T4 GPU support to Windows matrix Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> * Set fetch-depth: 1 for checkout steps and favor L4/T4 over A100 GPUs for Windows testing Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> * Revert Windows workflow to GitHub-hosted runners with TODO comments for future self-hosted migration Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> * [pre-commit.ci] auto code formatting * Revert Win runner name change for now --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> Co-authored-by: Leo Fang <leof@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * forgot to add windows * rerun codegen with 12.9.1 and update result/error explanations * First stab at the filter for CUDA < 13 in CI * Get data from the top-level array * Use the map function on select output * CI: Move to self-hosted Windows GPU runners Migrate the Windows testing to use the new NV GHA runners. Cherry-pick #958. --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: leofang <5534781+leofang@users.noreply.github.com> Co-authored-by: Leo Fang <leof@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Marcus D. Hanwell <mhanwell@nvidia.com>

Initial plan

2392641

Copilot AI assigned Copilot and leofang Aug 22, 2025

github-project-automation bot added this to CCCL Aug 22, 2025

github-project-automation bot moved this to Todo in CCCL Aug 22, 2025

Consolidate test matrices from workflows into ci/test-matrix.json

70833d1

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] CI: Consolidate the test matrix in a json file~~ CI: Consolidate test matrix configurations into ci/test-matrix.json Aug 22, 2025

Copilot AI requested a review from leofang August 22, 2025 02:00

leofang reviewed Aug 22, 2025

View reviewed changes

ci/test-matrix.json Show resolved Hide resolved

Hard-code all GPU and ARCH values in test-matrix.json with 6 fields p…

27232d6

…er entry Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

Copilot AI changed the title ~~CI: Consolidate test matrix configurations into ci/test-matrix.json~~ CI: Consolidate test matrix configurations into ci/test-matrix.json with hard-coded values Aug 22, 2025

Copilot AI requested a review from leofang August 22, 2025 02:13

leofang reviewed Aug 24, 2025

View reviewed changes

ci/test-matrix.json Show resolved Hide resolved

Update Windows test matrix with a100 GPU and latest-1 driver, configu…

111875a

…re self-hosted runners Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

Copilot AI changed the title ~~CI: Consolidate test matrix configurations into ci/test-matrix.json with hard-coded values~~ CI: Consolidate test matrix configurations into ci/test-matrix.json with hard-coded values and Windows self-hosted runners Aug 24, 2025

Copilot AI requested a review from leofang August 24, 2025 03:56

leofang reviewed Aug 24, 2025

View reviewed changes

.github/workflows/test-wheel-windows.yml Outdated Show resolved Hide resolved

fix

eed0b71

leofang requested a review from cryos August 24, 2025 03:59

leofang reviewed Aug 24, 2025

View reviewed changes

ci/test-matrix.json Outdated Show resolved Hide resolved

Revert eed0b71 and change Windows DRIVER from latest-1 to latest

69875a3

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

Revert Windows workflow to GitHub-hosted runners with TODO comments f…

8afb850

…or future self-hosted migration Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

Copilot AI requested a review from leofang September 5, 2025 02:43

leofang approved these changes Sep 5, 2025

View reviewed changes

pre-commit-ci bot and others added 2 commits September 5, 2025 02:50

[pre-commit.ci] auto code formatting

a9b97fc

Merge branch 'main' into copilot/fix-888

905ad53

leofang marked this pull request as ready for review September 5, 2025 02:51

cryos previously approved these changes Sep 5, 2025

View reviewed changes

ci/test-matrix.json Show resolved Hide resolved

github-project-automation bot moved this from In Progress to In Review in CCCL Sep 5, 2025

leofang reviewed Sep 5, 2025

View reviewed changes

.github/workflows/test-wheel-windows.yml Outdated Show resolved Hide resolved

Revert Win runner name change for now

1ad9d18

leofang dismissed cryos’s stale review via 1ad9d18 September 5, 2025 18:00

leofang merged commit b4644f2 into main Sep 5, 2025
1 check passed

leofang deleted the copilot/fix-888 branch September 5, 2025 18:02

github-project-automation bot moved this from In Review to Done in CCCL Sep 5, 2025

leofang added this to the cuda.core beta 7 milestone Sep 5, 2025

cryos mentioned this pull request Sep 9, 2025

CI: Centralize all runner usage in a json file #328

Closed

leofang mentioned this pull request Sep 9, 2025

Use CTK 12.9.1 for cuda-bindings 12.9.x #955

Merged

5 tasks

leofang added the to-be-backported Trigger the bot to raise a backport PR upon merge label Sep 10, 2025

Conversation

Copilot AI commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Matrix Structure

Self-Hosted Runner Configuration

Performance Optimizations

Windows GPU Configuration

Migration Strategy

Preserved Functionality

Benefits

Uh oh!

copy-pr-bot bot commented Aug 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leofang commented Aug 24, 2025

Uh oh!

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

leofang commented Sep 5, 2025

Uh oh!

leofang commented Sep 5, 2025

Uh oh!

leofang commented Sep 5, 2025

Uh oh!

cryos left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leofang commented Sep 5, 2025

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

github-actions bot commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented Aug 22, 2025 •

edited

Loading