Skip to content

Commit 54896da

Browse files
clee2000facebook-github-bot
authored andcommitted
download test times during build to avoid race conditions (#81915) (#81915)
Summary: After #81116, we started pulling test times straight from the source instead of first downloading them in the build job and then having the test job take the build jobs version. This can cause an issues where different shards pull different versions of the file, leading to incorrect sharding (ex two shards running the same tests file on accident). This generally happens if the test jobs happen while the test times file is being updated (unlikely, but not impossible) or if someone reruns a test job the next day. In this PR, I return to the old method of downloading the test times file during the build job and having the test jobs pull from the build jobs uploaded artifacts. If there is no test times file in the build job's artifacts, we fall back to the default sharding plan. Notes: * script moved to a new file to avoid needing to import torch, which would require torch to be built, which can cause issues with asan * I got errors with asan (`ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.`), so I put the script at the beginning of the build ### Test Plan Verified that the number of tests ran in the pull and trunk workflows are similar to workflows run on master. Checked logs to see if artifacts were being used for sharding. Spot checked a few test configs to check that their lists of selected tests didn't overlap. Pull Request resolved: #81915 Approved by: https://github.com/huydhn Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/86f038dd56dab9ecec5893b60efc74e46ca19e36 Reviewed By: osalpekar Differential Revision: D38252585 Pulled By: clee2000 fbshipit-source-id: 912b5fa0977647a79785e24613355ff0879bcacf
1 parent a7e3840 commit 54896da

File tree

11 files changed

+56
-15
lines changed

11 files changed

+56
-15
lines changed

.github/workflows/_linux-build.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,7 @@ jobs:
137137
- name: Archive artifacts into zip
138138
if: inputs.build-generates-artifacts
139139
run: |
140-
zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin
140+
zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
141141
142142
- name: Store PyTorch Build Artifacts on S3
143143
uses: seemethere/upload-artifact-s3@v5

.github/workflows/_mac-build.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ jobs:
102102
- name: Archive artifacts into zip
103103
if: inputs.build-generates-artifacts
104104
run: |
105-
zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json
105+
zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json
106106
107107
- name: Store PyTorch Build Artifacts on GHA
108108
uses: actions/upload-artifact@v2

.jenkins/pytorch/build-asan.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"
1212
echo "Clang version:"
1313
clang --version
1414

15+
python tools/stats/export_test_times.py
16+
1517
# detect_leaks=0: Python is very leaky, so we need suppress it
1618
# symbolize=1: Gives us much better errors when things go wrong
1719
export ASAN_OPTIONS=detect_leaks=0:detect_stack_use_after_return=1:symbolize=1:detect_odr_violation=0

.jenkins/pytorch/build.sh

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -296,4 +296,10 @@ else
296296
fi
297297
fi
298298

299+
if [[ "$BUILD_ENVIRONMENT" != *libtorch* && "$BUILD_ENVIRONMENT" != *bazel* ]]; then
300+
# export test times so that potential sharded tests that'll branch off this build will use consistent data
301+
# don't do this for libtorch as libtorch is C++ only and thus won't have python tests run on its build
302+
python tools/stats/export_test_times.py
303+
fi
304+
299305
print_sccache_stats

.jenkins/pytorch/macos-build.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,4 +73,6 @@ if which sccache > /dev/null; then
7373
print_sccache_stats
7474
fi
7575

76+
python tools/stats/export_test_times.py
77+
7678
assert_git_not_dirty

.jenkins/pytorch/win-test-helpers/build_pytorch.bat

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,10 @@ python setup.py install --cmake && sccache --show-stats && (
146146
if errorlevel 1 exit /b
147147
if not errorlevel 0 exit /b
148148

149+
:: export test times so that potential sharded tests that'll branch off this build will use consistent data
150+
python tools/stats/export_test_times.py
151+
copy /Y ".pytorch-test-times.json" "%PYTORCH_FINAL_PACKAGE_DIR%"
152+
149153
:: Also save build/.ninja_log as an artifact
150154
copy /Y "build\.ninja_log" "%PYTORCH_FINAL_PACKAGE_DIR%\"
151155
)

.jenkins/pytorch/win-test-helpers/test_python_jit_legacy.bat

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat
22

33
echo Copying over test times file
4+
copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-times.json" "%PROJECT_DIR_WIN%"
45

56
pushd test
67

.jenkins/pytorch/win-test-helpers/test_python_shard.bat

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,9 @@ if "%SHARD_NUMBER%" == "1" (
2121
)
2222
)
2323

24+
echo Copying over test times file
25+
copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-times.json" "%PROJECT_DIR_WIN%"
26+
2427
echo Run nn tests
2528
python run_test.py --exclude-jit-executor --exclude-distributed-tests --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose
2629
if ERRORLEVEL 1 goto fail

test/run_test.py

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@
1313
import subprocess
1414
import sys
1515
import tempfile
16+
import json
17+
from typing import Dict, Optional, List, cast, Any
1618

1719
import torch
1820
from torch.utils import cpp_extension
@@ -25,14 +27,13 @@
2527
parser as common_parser,
2628
)
2729
import torch.distributed as dist
28-
from typing import Optional, List
2930

3031
REPO_ROOT = pathlib.Path(__file__).resolve().parent.parent
3132

3233
try:
3334
# using tools/ to optimize test run.
3435
sys.path.append(str(REPO_ROOT))
35-
from tools.stats.import_test_stats import get_test_times
36+
from tools.stats.export_test_times import TEST_TIMES_FILE
3637
from tools.testing.test_selections import (
3738
get_reordered_tests,
3839
get_test_case_configs,
@@ -72,6 +73,7 @@ def skip_test_p(name: str) -> bool:
7273
rc += extra_tests
7374
return sorted(rc)
7475

76+
7577
TESTS = discover_tests(
7678
blocklisted_patterns=[
7779
'ao',
@@ -268,9 +270,6 @@ def skip_test_p(name: str) -> bool:
268270
"test_torch"
269271
]
270272

271-
# the JSON file to store the S3 test stats
272-
TEST_TIMES_FILE = ".pytorch-test-times.json"
273-
274273
# if a test file takes longer than 5 min, we add it to TARGET_DET_LIST
275274
SLOW_TEST_THRESHOLD = 300
276275

@@ -395,14 +394,14 @@ def test_cuda_primary_ctx(test_module, test_directory, options):
395394
test_module, test_directory, options, extra_unittest_args=["--subprocess"]
396395
)
397396

397+
398398
run_test_with_subprocess = functools.partial(run_test, extra_unittest_args=["--subprocess"])
399399

400400

401401
def get_run_test_with_subprocess_fn():
402402
return lambda test_module, test_directory, options: run_test_with_subprocess(test_module, test_directory, options)
403403

404404

405-
406405
def _test_cpp_extensions_aot(test_directory, options, use_ninja):
407406
if use_ninja:
408407
try:
@@ -570,6 +569,7 @@ def test_distributed(test_module, test_directory, options):
570569
"distributed/rpc/cuda/test_tensorpipe_agent": get_run_test_with_subprocess_fn(),
571570
}
572571

572+
573573
def parse_test_module(test):
574574
return test.split(".")[0]
575575

@@ -862,14 +862,21 @@ def get_selected_tests(options):
862862
return selected_tests
863863

864864
# Download previous test times to make sharding decisions
865-
test_file_times = get_test_times(str(REPO_ROOT), filename=TEST_TIMES_FILE)
866-
if len(test_file_times) == 0:
865+
path = os.path.join(str(REPO_ROOT), TEST_TIMES_FILE)
866+
if os.path.exists(path):
867+
with open(path, "r") as f:
868+
test_file_times = cast(Dict[str, Any], json.load(f))
869+
else:
870+
test_file_times = {}
871+
if os.environ["TEST_CONFIG"] not in test_file_times:
867872
print(
868-
"::warning:: Gathered no stats from S3. Proceeding with default sharding plan."
873+
"::warning:: Gathered no stats from artifacts. Proceeding with default sharding plan."
869874
)
870875
selected_tests = selected_tests[which_shard - 1 :: num_shards]
871876
else:
872-
shards = calculate_shards(num_shards, selected_tests, test_file_times)
877+
print("Found test time stats from artifacts")
878+
test_file_times_config = test_file_times[os.environ["TEST_CONFIG"]]
879+
shards = calculate_shards(num_shards, selected_tests, test_file_times_config)
873880
_, tests_from_shard = shards[which_shard - 1]
874881
selected_tests = tests_from_shard
875882

tools/stats/export_test_times.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
import pathlib
2+
import sys
3+
4+
REPO_ROOT = pathlib.Path(__file__).resolve().parent.parent.parent
5+
sys.path.append(str(REPO_ROOT))
6+
from tools.stats.import_test_stats import get_test_times
7+
8+
TEST_TIMES_FILE = ".pytorch-test-times.json"
9+
10+
11+
def main() -> None:
12+
print(f"Exporting test times from test-infra to {TEST_TIMES_FILE}")
13+
get_test_times(str(REPO_ROOT), filename=TEST_TIMES_FILE)
14+
15+
16+
if __name__ == "__main__":
17+
main()

0 commit comments

Comments
 (0)