-
Notifications
You must be signed in to change notification settings - Fork 2k
[TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures #6836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds an env-driven UCXX fetch override in CMake, exposes TRITON build args and switches base registry in docker Makefile, refactors Jenkins build to use a retry wrapper and mirror rewrite, updates L0 test image/mirror logic, adds auto_trigger in an integration YAML, and removes CUDA_HOME auto-default in a unit test. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Jenkins
participant Groovy as BuildDockerImage.groovy
participant Utils as trtllm_utils.llmExecStepWithRetry
participant Docker
participant Make
Jenkins->>Groovy: Start pipeline
Groovy->>Groovy: Parse BASE_IMAGE/TRITON_IMAGE/TRITON_BASE_TAG from files
Groovy->>Groovy: Override BASE_IMAGE for rockylinux8 if present
Groovy->>Groovy: Rewrite registry nvcr.io -> urm.nvidia.com/docker
Groovy->>Utils: llmExecStepWithRetry(docker pull/build ...)
Utils->>Docker: Execute pull/build (with retries)
Docker-->>Utils: result
Utils-->>Groovy: status
Groovy->>Utils: llmExecStepWithRetry(make target ... BASE_IMAGE/TRITON_IMAGE ...)
Utils->>Make: Execute make with args
Make-->>Utils: result
Utils-->>Groovy: status
Groovy-->>Jenkins: Build complete
sequenceDiagram
autonumber
participant CMake
participant Env
participant FS as fetch_rapids.cmake
CMake->>CMake: Find UCX
CMake->>Env: Check GITHUB_MIRROR
alt UCX found and GITHUB_MIRROR set and file exists
CMake->>FS: Read contents
CMake->>FS: Replace rapids-cmake URL with mirror URL
CMake->>FS: Write contents
CMake-->>CMake: Emit warning about replacement
end
CMake-->>CMake: Continue UCXX setup
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 💡 Knowledge Base configuration:
You can enable these settings in your CodeRabbit configuration. 📒 Files selected for processing (6)
💤 Files with no reviewable changes (1)
🚧 Files skipped from review as they are similar to previous changes (4)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (4)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
|
/bot run --skip-test --extra-stage "Build-Docker-Images" |
|
PR_Github #14995 [ run ] triggered by Bot |
|
/bot run --stage-list "Build-Docker-Images" |
1 similar comment
|
/bot run --stage-list "Build-Docker-Images" |
|
PR_Github #14999 [ run ] triggered by Bot |
|
PR_Github #14995 [ run ] completed with state |
|
PR_Github #14999 [ run ] completed with state |
|
/bot run --stage-list "Build-Docker-Images" |
|
PR_Github #15224 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
jenkins/BuildDockerImage.groovy (1)
347-353: Wrap custom tag stage with the same retry strategyThe custom tag flow still uses a plain sh call. For consistency and resiliency, wrap it with llmExecStepWithRetry and pre-pull the TRITON base.
Apply this diff:
- stage ("custom tag: ${customTag} (${arch})") { - sh """ - cd ${LLM_ROOT} && make -C docker ${target}_${action} - BASE_IMAGE=${BASE_IMAGE} - TRITON_IMAGE=${TRITON_IMAGE} - TORCH_INSTALL_TYPE=${torchInstallType} - IMAGE_WITH_TAG=${customImageWithTag} - STAGE=${dockerfileStage} - BUILD_WHEEL_OPTS='-j ${build_jobs}' ${args} - """ - } + stage ("custom tag: ${customTag} (${arch})") { + def randomSleep = (Math.random() * 300 + 300).toInteger() + trtllm_utils.llmExecStepWithRetry(this, script: "docker pull ${TRITON_IMAGE}:${TRITON_BASE_TAG}", sleepInSecs: randomSleep, shortCommondRunTimeMax: 7200) + trtllm_utils.llmExecStepWithRetry(this, script: """ + cd ${LLM_ROOT} && make -C docker ${target}_${action} \ + BASE_IMAGE=${BASE_IMAGE} \ + TRITON_IMAGE=${TRITON_IMAGE} \ + TORCH_INSTALL_TYPE=${torchInstallType} \ + IMAGE_WITH_TAG=${customImageWithTag} \ + STAGE=${dockerfileStage} \ + BUILD_WHEEL_OPTS='-j ${build_jobs}' ${args} + """, sleepInSecs: randomSleep, numRetries: 3, shortCommondRunTimeMax: 7200) + }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cpp/CMakeLists.txt(1 hunks)docker/Makefile(3 hunks)jenkins/BuildDockerImage.groovy(4 hunks)jenkins/L0_Test.groovy(1 hunks)tests/integration/test_lists/test-db/l0_dgx_h100.yml(1 hunks)tests/unittest/test_pip_install.py(0 hunks)
💤 Files with no reviewable changes (1)
- tests/unittest/test_pip_install.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/test-db/l0_dgx_h100.yml
🔇 Additional comments (9)
jenkins/L0_Test.groovy (1)
42-42: Mirror switch looks fine; ensure cluster can pull from URMThe DLFW image now points to URM. Verify that K8s nodes running these stages have imagePullSecrets or registry access to urm.nvidia.com; otherwise pulls will fail at runtime.
tests/integration/test_lists/test-db/l0_dgx_h100.yml (1)
218-221: Aligns with mako auto_trigger flowAdding auto_trigger: others to the Triton post_merge block matches how getMakoArgsFromStageName emits auto_trigger and will correctly filter these tests.
cpp/CMakeLists.txt (1)
498-508: Action needed — cannot verify: fetch_rapids.cmake not found in repo; please confirm or apply safer rewriteI searched the repository and did not find fetch_rapids.cmake, so I cannot confirm whether it always uses branch refs (tags/SHAs may be used). Please confirm the file path (or whether it's generated at build time). If the file is present/managed here, apply the safer rewrite below to avoid breaking non-branch refs and to reduce noisy logs.
Files/locations to check:
- cpp/CMakeLists.txt (the snippet at lines ~498–508 that rewrites rapids-cmake URLs)
- Expected target file: ${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake (may be absent or generated at build time)
Suggested patch:
- if(DEFINED ENV{GITHUB_MIRROR} AND NOT "$ENV{GITHUB_MIRROR}" STREQUAL "") - if(EXISTS "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake") - file(READ "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake" FILE_CONTENTS) - string( - REPLACE "https://raw.githubusercontent.com/rapidsai/rapids-cmake" - "$ENV{GITHUB_MIRROR}/rapidsai/rapids-cmake/raw/refs/heads" - FILE_CONTENTS "${FILE_CONTENTS}") - file(WRITE "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake" "${FILE_CONTENTS}") - message(WARNING "Replace UCXX fetch_rapids.cmake with ${FILE_CONTENTS}") - endif() - endif() + if(DEFINED ENV{GITHUB_MIRROR} AND NOT "$ENV{GITHUB_MIRROR}" STREQUAL "") + set(_rapids_fetch "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake") + if(EXISTS "${_rapids_fetch}") + file(READ "${_rapids_fetch}" _FILE_CONTENTS) + set(_OLD "https://raw.githubusercontent.com/rapidsai/rapids-cmake") + # Use generic /raw so tags and SHAs also work + set(_NEW "$ENV{GITHUB_MIRROR}/rapidsai/rapids-cmake/raw") + string(REPLACE "${_OLD}" "${_NEW}" _NEW_CONTENTS "${_FILE_CONTENTS}") + if(NOT "${_NEW_CONTENTS}" STREQUAL "${_FILE_CONTENTS}") + file(WRITE "${_rapids_fetch}" "${_NEW_CONTENTS}") + message(STATUS "Rewrote rapids-cmake URLs in ${_rapids_fetch} to ${_NEW}/*") + else() + message(STATUS "No rapids-cmake URL to rewrite in ${_rapids_fetch}") + endif() + endif() + endif()docker/Makefile (3)
4-5: Good: plumbs TRITON metadata from Dockerfile.multiDeriving TRITON_IMAGE and TRITON_BASE_TAG here keeps Makefile self-sufficient when Jenkins doesn’t pass them explicitly.
85-87: Build args: pass TRITON metadata through to DockerfilePlumbing TRITON_IMAGE and TRITON_BASE_TAG into docker build ensures consistent base pull. Matches Jenkins’ pre-pull behavior.
194-204: Base image source switched to nvcr.io/nvidia/cuda for RL8/Ubuntu22 targetsLooks correct and aligns with the Jenkins rewrite to URM at runtime. No functional issues.
jenkins/BuildDockerImage.groovy (3)
261-261: Minor: env | sort is useful for debuggingThis helps reproducibility and auditing of build context. Keep it.
298-309: LGTM: pre-pull and retry wrapper mitigate external flakinessPre-pulling the TRITON base with randomized backoff is a solid way to reduce thundering-herd and registry spikes.
326-337: LGTM: main build also benefits from pre-pull + retryConsistent use of the retry helper and staggered start times should improve success rates for multi-job builds.
|
/bot run --stage-list "Build-Docker-Images" |
|
PR_Github #15286 [ run ] triggered by Bot |
… failures Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
…first Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
|
PR_Github #15286 [ run ] completed with state |
ZhanruiSunCh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Is it expected that the mirror will fix the ucxx issue? Judging by the failed pipeline, the ucxx problem will still occur. |
|
/bot skip --comment "LLM/main/L0_MergeRequest_PR pipeline #11539 + LLM/main/L0_MergeRequest pipeline #29153" |
|
PR_Github #15378 [ skip ] triggered by Bot |
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Summary by CodeRabbit
New Features
Chores
Tests