[TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures #6836

chzblych · 2025-08-12T15:45:15Z

Summary by CodeRabbit

New Features
- Fetch UCXX from a configurable GitHub mirror when an environment variable is set.
- Docker builds can accept and propagate TRITON_IMAGE and TRITON_BASE_TAG build arguments.
Chores
- Switched base image registry references to internal mirrors for builds.
- Unified retry wrapper and standardized environment output to improve build robustness.
Tests
- Enabled auto-trigger for specific Triton post-merge tests on H100.
- Removed implicit CUDA_HOME defaulting in a pip install unit test.

coderabbitai · 2025-08-12T15:45:22Z

📝 Walkthrough

Walkthrough

Adds an env-driven UCXX fetch override in CMake, exposes TRITON build args and switches base registry in docker Makefile, refactors Jenkins build to use a retry wrapper and mirror rewrite, updates L0 test image/mirror logic, adds auto_trigger in an integration YAML, and removes CUDA_HOME auto-default in a unit test.

Changes

Cohort / File(s)	Summary
CMake UCXX mirror override `cpp/CMakeLists.txt`	If UCX is found and `GITHUB_MIRROR` is set and `${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake` exists, read and replace `https://raw.githubusercontent.com/rapidsai/rapids-cmake` with `$ENV{GITHUB_MIRROR}/rapidsai/rapids-cmake/raw/refs/heads`, write back and emit a warning. No API/signature changes.
Docker build variables and registry `docker/Makefile`	Add `TRITON_IMAGE` and `TRITON_BASE_TAG` by extracting `ARG` values from `Dockerfile.multi`; pass them as `--build-arg` when set; change specific `BASE_IMAGE` values from `nvidia/cuda` → `nvcr.io/nvidia/cuda`.
Jenkins build pipeline refactor `jenkins/BuildDockerImage.groovy`	Print `env
Jenkins L0 image registry and guard `jenkins/L0_Test.groovy`	Change DLFW PyTorch image from `nvcr.io/nvidia/pytorch:25.06-py3` to `urm.nvidia.com/docker/nvidia/pytorch:25.06-py3`; add guard to skip pip install sanity check on AArch64 when DLFW image differs.
Tests: integration config `tests/integration/test_lists/test-db/l0_dgx_h100.yml`	Add `auto_trigger: others` to post_merge Triton backend blocks (two locations).
Tests: unit test env handling `tests/unittest/test_pip_install.py`	Remove code that defaulted `CUDA_HOME` to `/usr/local/cuda` when unset.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Jenkins
  participant Groovy as BuildDockerImage.groovy
  participant Utils as trtllm_utils.llmExecStepWithRetry
  participant Docker
  participant Make

  Jenkins->>Groovy: Start pipeline
  Groovy->>Groovy: Parse BASE_IMAGE/TRITON_IMAGE/TRITON_BASE_TAG from files
  Groovy->>Groovy: Override BASE_IMAGE for rockylinux8 if present
  Groovy->>Groovy: Rewrite registry nvcr.io -> urm.nvidia.com/docker
  Groovy->>Utils: llmExecStepWithRetry(docker pull/build ...)
  Utils->>Docker: Execute pull/build (with retries)
  Docker-->>Utils: result
  Utils-->>Groovy: status
  Groovy->>Utils: llmExecStepWithRetry(make target ... BASE_IMAGE/TRITON_IMAGE ...)
  Utils->>Make: Execute make with args
  Make-->>Utils: result
  Utils-->>Groovy: status
  Groovy-->>Jenkins: Build complete

sequenceDiagram
  autonumber
  participant CMake
  participant Env
  participant FS as fetch_rapids.cmake

  CMake->>CMake: Find UCX
  CMake->>Env: Check GITHUB_MIRROR
  alt UCX found and GITHUB_MIRROR set and file exists
    CMake->>FS: Read contents
    CMake->>FS: Replace rapids-cmake URL with mirror URL
    CMake->>FS: Write contents
    CMake-->>CMake: Emit warning about replacement
  end
  CMake-->>CMake: Continue UCXX setup

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

[None][infra] Avoid intermittent access broken to nvcr.io #6715: Similar changes to docker/Makefile and Jenkins plumbing for TRITON image args and registry adjustments.
[TRTLLM-6893][infra] fix Build Docker Image tag issue #6555: Modifies jenkins/BuildDockerImage.groovy (pipeline retry / error handling overlap).
[fix] Update jenkins container images #6094: Overlaps on docker/Makefile Jenkins Rocky Linux target and image selection logic.

Suggested reviewers

Shixiaowei02
yuanjingx87
ZhanruiSunCh
niukuo

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4811d4b and 85aea1a.

📒 Files selected for processing (6)

cpp/CMakeLists.txt (1 hunks)
docker/Makefile (3 hunks)
jenkins/BuildDockerImage.groovy (4 hunks)
jenkins/L0_Test.groovy (2 hunks)
tests/integration/test_lists/test-db/l0_dgx_h100.yml (1 hunks)
tests/unittest/test_pip_install.py (0 hunks)

💤 Files with no reviewable changes (1)

tests/unittest/test_pip_install.py

🚧 Files skipped from review as they are similar to previous changes (4)

docker/Makefile
jenkins/L0_Test.groovy
cpp/CMakeLists.txt
tests/integration/test_lists/test-db/l0_dgx_h100.yml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (4)

jenkins/BuildDockerImage.groovy (4)

261-261: LGTM! Improved environment visibility for debugging.

Adding env | sort provides sorted environment output which is helpful for debugging build issues.

298-308: LGTM! Excellent retry mechanism with randomized backoff.

The implementation properly addresses intermittent network failures by:

Using trtllm_utils.llmExecStepWithRetry for both docker pull and make commands

Implementing randomized sleep (300-600 seconds) to avoid thundering herd problems

Setting appropriate timeouts (7200 seconds) and retry counts (3)

Passing the extracted image variables to the build process

326-337: LGTM! Consistent retry pattern implementation.

The main build stage correctly implements the same retry pattern as the dependent builds with proper environment variable passing and randomized backoff.

347-348: LGTM! Consistent image variable passing.

The custom tag stage correctly passes the BASE_IMAGE and TRITON_IMAGE variables to maintain consistency with the other build stages.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

ZhanruiSunCh · 2025-08-12T15:48:21Z

/bot run --skip-test --extra-stage "Build-Docker-Images"

tensorrt-cicd · 2025-08-12T15:54:14Z

PR_Github #14995 [ run ] triggered by Bot

chzblych · 2025-08-12T15:58:08Z

/bot run --stage-list "Build-Docker-Images"

chzblych · 2025-08-12T16:06:08Z

/bot run --stage-list "Build-Docker-Images"

tensorrt-cicd · 2025-08-12T16:11:16Z

PR_Github #14999 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-12T16:11:18Z

PR_Github #14995 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-08-12T23:52:41Z

PR_Github #14999 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11326 (Partly Tested) completed with status: 'FAILURE'

chzblych · 2025-08-14T04:51:30Z

/bot run --stage-list "Build-Docker-Images"

tensorrt-cicd · 2025-08-14T04:58:27Z

PR_Github #15224 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

jenkins/BuildDockerImage.groovy (1)

347-353: Wrap custom tag stage with the same retry strategy

The custom tag flow still uses a plain sh call. For consistency and resiliency, wrap it with llmExecStepWithRetry and pre-pull the TRITON base.

Apply this diff:

-            stage ("custom tag: ${customTag} (${arch})") {
-                sh """
-                cd ${LLM_ROOT} && make -C docker ${target}_${action} 
-                BASE_IMAGE=${BASE_IMAGE} 
-                TRITON_IMAGE=${TRITON_IMAGE} 
-                TORCH_INSTALL_TYPE=${torchInstallType} 
-                IMAGE_WITH_TAG=${customImageWithTag} 
-                STAGE=${dockerfileStage} 
-                BUILD_WHEEL_OPTS='-j ${build_jobs}' ${args}
-                """
-            }
+            stage ("custom tag: ${customTag} (${arch})") {
+                def randomSleep = (Math.random() * 300 + 300).toInteger()
+                trtllm_utils.llmExecStepWithRetry(this, script: "docker pull ${TRITON_IMAGE}:${TRITON_BASE_TAG}", sleepInSecs: randomSleep, shortCommondRunTimeMax: 7200)
+                trtllm_utils.llmExecStepWithRetry(this, script: """
+                cd ${LLM_ROOT} && make -C docker ${target}_${action} \
+                BASE_IMAGE=${BASE_IMAGE} \
+                TRITON_IMAGE=${TRITON_IMAGE} \
+                TORCH_INSTALL_TYPE=${torchInstallType} \
+                IMAGE_WITH_TAG=${customImageWithTag} \
+                STAGE=${dockerfileStage} \
+                BUILD_WHEEL_OPTS='-j ${build_jobs}' ${args}
+                """, sleepInSecs: randomSleep, numRetries: 3, shortCommondRunTimeMax: 7200)
+            }

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d8acca4 and 4811d4b.

📒 Files selected for processing (6)

cpp/CMakeLists.txt (1 hunks)
docker/Makefile (3 hunks)
jenkins/BuildDockerImage.groovy (4 hunks)
jenkins/L0_Test.groovy (1 hunks)
tests/integration/test_lists/test-db/l0_dgx_h100.yml (1 hunks)
tests/unittest/test_pip_install.py (0 hunks)

💤 Files with no reviewable changes (1)

tests/unittest/test_pip_install.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/test_lists/test-db/l0_dgx_h100.yml

🔇 Additional comments (9)

jenkins/L0_Test.groovy (1)

42-42: Mirror switch looks fine; ensure cluster can pull from URM

The DLFW image now points to URM. Verify that K8s nodes running these stages have imagePullSecrets or registry access to urm.nvidia.com; otherwise pulls will fail at runtime.

tests/integration/test_lists/test-db/l0_dgx_h100.yml (1)

218-221: Aligns with mako auto_trigger flow

Adding auto_trigger: others to the Triton post_merge block matches how getMakoArgsFromStageName emits auto_trigger and will correctly filter these tests.
cpp/CMakeLists.txt (1)

498-508: Action needed — cannot verify: fetch_rapids.cmake not found in repo; please confirm or apply safer rewrite

I searched the repository and did not find fetch_rapids.cmake, so I cannot confirm whether it always uses branch refs (tags/SHAs may be used). Please confirm the file path (or whether it's generated at build time). If the file is present/managed here, apply the safer rewrite below to avoid breaking non-branch refs and to reduce noisy logs.

Files/locations to check:

cpp/CMakeLists.txt (the snippet at lines ~498–508 that rewrites rapids-cmake URLs)

Expected target file: ${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake (may be absent or generated at build time)

Suggested patch:
-    if(DEFINED ENV{GITHUB_MIRROR} AND NOT "$ENV{GITHUB_MIRROR}" STREQUAL "")
-      if(EXISTS "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake")
-        file(READ "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake" FILE_CONTENTS)
-        string(
-          REPLACE "https://raw.githubusercontent.com/rapidsai/rapids-cmake"
-                  "$ENV{GITHUB_MIRROR}/rapidsai/rapids-cmake/raw/refs/heads"
-                  FILE_CONTENTS "${FILE_CONTENTS}")
-        file(WRITE "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake" "${FILE_CONTENTS}")
-        message(WARNING "Replace UCXX fetch_rapids.cmake with ${FILE_CONTENTS}")
-      endif()
-    endif()
+    if(DEFINED ENV{GITHUB_MIRROR} AND NOT "$ENV{GITHUB_MIRROR}" STREQUAL "")
+      set(_rapids_fetch "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake")
+      if(EXISTS "${_rapids_fetch}")
+        file(READ "${_rapids_fetch}" _FILE_CONTENTS)
+        set(_OLD "https://raw.githubusercontent.com/rapidsai/rapids-cmake")
+        # Use generic /raw so tags and SHAs also work
+        set(_NEW "$ENV{GITHUB_MIRROR}/rapidsai/rapids-cmake/raw")
+        string(REPLACE "${_OLD}" "${_NEW}" _NEW_CONTENTS "${_FILE_CONTENTS}")
+        if(NOT "${_NEW_CONTENTS}" STREQUAL "${_FILE_CONTENTS}")
+          file(WRITE "${_rapids_fetch}" "${_NEW_CONTENTS}")
+          message(STATUS "Rewrote rapids-cmake URLs in ${_rapids_fetch} to ${_NEW}/*")
+        else()
+          message(STATUS "No rapids-cmake URL to rewrite in ${_rapids_fetch}")
+        endif()
+      endif()
+    endif()
docker/Makefile (3)

4-5: Good: plumbs TRITON metadata from Dockerfile.multi

Deriving TRITON_IMAGE and TRITON_BASE_TAG here keeps Makefile self-sufficient when Jenkins doesn’t pass them explicitly.

85-87: Build args: pass TRITON metadata through to Dockerfile

Plumbing TRITON_IMAGE and TRITON_BASE_TAG into docker build ensures consistent base pull. Matches Jenkins’ pre-pull behavior.

194-204: Base image source switched to nvcr.io/nvidia/cuda for RL8/Ubuntu22 targets

Looks correct and aligns with the Jenkins rewrite to URM at runtime. No functional issues.

jenkins/BuildDockerImage.groovy (3)

261-261: Minor: env | sort is useful for debugging

This helps reproducibility and auditing of build context. Keep it.

298-309: LGTM: pre-pull and retry wrapper mitigate external flakiness

Pre-pulling the TRITON base with randomized backoff is a solid way to reduce thundering-herd and registry spikes.

326-337: LGTM: main build also benefits from pre-pull + retry

Consistent use of the retry helper and staggered start times should improve success rates for multi-job builds.

jenkins/BuildDockerImage.groovy

chzblych · 2025-08-14T10:11:19Z

/bot run --stage-list "Build-Docker-Images"

tensorrt-cicd · 2025-08-14T10:16:32Z

PR_Github #15286 [ run ] triggered by Bot

… failures Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>

Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>

…first Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>

tensorrt-cicd · 2025-08-14T20:01:32Z

PR_Github #15286 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11539 (Partly Tested) completed with status: 'FAILURE'

ZhanruiSunCh

LGTM

ZhanruiSunCh · 2025-08-15T02:21:17Z

Is it expected that the mirror will fix the ucxx issue? Judging by the failed pipeline, the ucxx problem will still occur.

chzblych · 2025-08-15T02:44:57Z

/bot skip --comment "LLM/main/L0_MergeRequest_PR pipeline #11539 + LLM/main/L0_MergeRequest pipeline #29153"

tensorrt-cicd · 2025-08-15T02:50:17Z

PR_Github #15378 [ skip ] triggered by Bot

…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

chzblych requested review from a team as code owners August 12, 2025 15:45

chzblych requested review from Shixiaowei02, niukuo and yuanjingx87 August 12, 2025 15:45

chzblych force-pushed the TRTLLM-7141 branch from 028b3e7 to ecc7cb2 Compare August 12, 2025 15:56

chzblych force-pushed the TRTLLM-7141 branch from ecc7cb2 to 4811d4b Compare August 14, 2025 04:50

coderabbitai bot reviewed Aug 14, 2025

View reviewed changes

jenkins/BuildDockerImage.groovy Show resolved Hide resolved

chzblych and others added 3 commits August 15, 2025 00:09

[TRTLLM-7141][infra] Use GitHub mirrors to avoid intermittent network…

553efce

… failures Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>

[None][infra] Avoid intermittent access broken to nvcr.io (NVIDIA#6715)

ccd212a

Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>

[https://nvbugs/5453827][war] Add a WAR for pip install sanity check …

85aea1a

…first Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>

chzblych force-pushed the TRTLLM-7141 branch from 4811d4b to 85aea1a Compare August 14, 2025 16:22

ZhanruiSunCh approved these changes Aug 15, 2025

View reviewed changes

coderabbitai bot mentioned this pull request Aug 25, 2025

[None][chore] Some improvements for CI stability #7199

Merged

coderabbitai bot mentioned this pull request Sep 2, 2025

[None][feat] Use a shell context to install dependancies #7383

Merged

1 task

[TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures #6836

[TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures #6836

Uh oh!

Conversation

chzblych commented Aug 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

ZhanruiSunCh commented Aug 12, 2025

Uh oh!

tensorrt-cicd commented Aug 12, 2025

Uh oh!

chzblych commented Aug 12, 2025

Uh oh!

chzblych commented Aug 12, 2025

Uh oh!

tensorrt-cicd commented Aug 12, 2025

Uh oh!

tensorrt-cicd commented Aug 12, 2025

Uh oh!

tensorrt-cicd commented Aug 12, 2025

Uh oh!

chzblych commented Aug 14, 2025

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chzblych commented Aug 14, 2025

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

ZhanruiSunCh left a comment

Choose a reason for hiding this comment

Uh oh!

ZhanruiSunCh commented Aug 15, 2025

Uh oh!

chzblych commented Aug 15, 2025

Uh oh!

tensorrt-cicd commented Aug 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chzblych commented Aug 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 12, 2025 •

edited

Loading