Skip to content

Commit 75c190b

Browse files
committed
Update base for Update on "Improved perfs for vectorized bilinear interpolate cpu uint8 RGB-case (channels last)"
## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitd6e220c) PR | torch (2.1.0a0+git2b75955) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.674 (+-0.323) | 57.591 (+-0.244) | 131.033 (+-1.448) | 2.275 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 39.471 (+-0.166) | 113.911 (+-1.736) | 2.886 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 128.512 (+-1.916) | 161.592 (+-1.242) | 299.679 (+-2.099) | 1.855 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 150.994 (+-1.180) | 285.331 (+-1.919) | 1.890 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 180.045 (+-2.223) | 220.581 (+-1.363) | 431.057 (+-3.536) | 1.954 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 219.391 (+-1.409) | 429.410 (+-3.620) | 1.957 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 113.911 (+-1.024) | 129.457 (+-1.295) | 459.610 (+-13.322) | 3.550 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 59.800 (+-0.199) | 400.015 (+-11.815) | 6.689 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.050 (+-2.664) | 339.143 (+-1.209) | 683.555 (+-4.466) | 2.016 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 250.601 (+-1.236) | 603.545 (+-2.644) | 2.408 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.723 (+-2.213) | 199.960 (+-1.343) | 860.867 (+-21.763) | 4.305 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 79.188 (+-0.261) | 703.019 (+-25.805) | 8.878 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 412.353 (+-4.476) | 462.230 (+-1.983) | 1101.673 (+-49.299) | 2.383 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 327.973 (+-1.852) | 941.062 (+-5.549) | 2.869 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 61.191 (+-0.926) | 80.795 (+-0.518) | 160.853 (+-1.506) | 1.991 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 134.488 (+-2.129) | 169.147 (+-1.324) | 327.343 (+-2.846) | 1.935 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 1037.045 (+-24.982) | 938.623 (+-9.010) | 2603.360 (+-20.530) | 2.774 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.792 (+-0.613) | 73.692 (+-0.264) | 131.829 (+-1.333) | 1.789 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 139.596 (+-1.944) | 173.778 (+-1.039) | 320.063 (+-2.562) | 1.842 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 690.132 (+-10.946) | 772.758 (+-2.864) | 2036.860 (+-36.109) | 2.636 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.747 (+-0.799) | 158.479 (+-1.702) | 2.013 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 167.046 (+-1.077) | 322.104 (+-2.764) | 1.928 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 918.967 (+-5.251) | 2611.388 (+-29.917) | 2.842 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 55.336 (+-0.251) | 113.869 (+-1.243) | 2.058 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 156.505 (+-1.095) | 299.861 (+-2.710) | 1.916 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 514.344 (+-1.905) | 1776.796 (+-19.660) | 3.454 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 datumbox pmeier [ghstack-poisoned]
2 parents ac5e824 + 53c9bc8 commit 75c190b

File tree

449 files changed

+36568
-7404
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

449 files changed

+36568
-7404
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
e650d3708be4dca12cc3491a2f8ab18ded47c368
1+
46672772b46b103db7341c9e10fbad7f643557d4

.ci/docker/common/install_conda.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -57,12 +57,12 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
5757
elif [ "$ANACONDA_PYTHON_VERSION" = "3.10" ]; then
5858
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}
5959
elif [ "$ANACONDA_PYTHON_VERSION" = "3.9" ]; then
60-
conda_install numpy=1.19.2 ${CONDA_COMMON_DEPS}
60+
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}
6161
elif [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then
62-
conda_install numpy=1.18.5 ${CONDA_COMMON_DEPS}
62+
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}
6363
else
6464
# Install `typing-extensions` for 3.7
65-
conda_install numpy=1.18.5 ${CONDA_COMMON_DEPS} typing-extensions
65+
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS} typing-extensions
6666
fi
6767

6868
# This is only supported in 3.8 upward

.ci/docker/common/install_onnx.sh

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,13 @@ pip_install \
1212
mock==5.0.1 \
1313
ninja==1.10.2 \
1414
networkx==2.0 \
15-
numpy==1.22.4 \
16-
onnx==1.13.1 \
15+
numpy==1.22.4
16+
17+
# TODO: use official onnx package once it's released
18+
# for now, use the commit from 1.13.1-protobuf4.21 branch
19+
pip_install "onnx@git+https://github.com/onnx/onnx@389b6bcb05b9479d149d29b2461fbffe8472ed14"
20+
21+
pip_install \
1722
onnxruntime==1.14.0 \
1823
parameterized==0.8.1 \
1924
pytest-cov==4.0.0 \

.ci/pytorch/test.sh

Lines changed: 19 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -264,7 +264,9 @@ test_inductor() {
264264
# and .github/workflows/inductor.yml
265265
DYNAMO_BENCHMARK_FLAGS=()
266266

267-
if [[ "${TEST_CONFIG}" == *aot_eager* ]]; then
267+
if [[ "${TEST_CONFIG}" == *dynamo_eager* ]]; then
268+
DYNAMO_BENCHMARK_FLAGS+=(--backend eager)
269+
elif [[ "${TEST_CONFIG}" == *aot_eager* ]]; then
268270
DYNAMO_BENCHMARK_FLAGS+=(--backend aot_eager)
269271
elif [[ "${TEST_CONFIG}" == *inductor* && "${TEST_CONFIG}" != *perf* ]]; then
270272
DYNAMO_BENCHMARK_FLAGS+=(--inductor)
@@ -288,14 +290,7 @@ test_perf_for_dashboard() {
288290
shift
289291

290292
for dtype in amp float32; do
291-
# Run accuracy test
292293
# All the accuracy tests can be skipped once the CI accuracy checking is stable enough
293-
for backend in eager aot_eager; do
294-
python "benchmarks/dynamo/$suite.py" \
295-
--accuracy --"$dtype" --backend "$backend" "$@" \
296-
--output "$TEST_REPORTS_DIR/${backend}_${suite}_${dtype}_training_cuda_accuracy.csv"
297-
done
298-
299294
# Run accuracy test for inductor with different configs
300295
# --disable-cudagraphs is the default inductor behavior
301296
# TODO: update here once cudagraphs is turned on as default
@@ -306,17 +301,23 @@ test_perf_for_dashboard() {
306301
python "benchmarks/dynamo/$suite.py" \
307302
--accuracy --"$dtype" --backend "$backend" "$@" \
308303
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_training_cuda_accuracy.csv"
304+
python "benchmarks/dynamo/$suite.py" \
305+
--accuracy --"$dtype" --backend "$backend" --dynamic-shapes --disable-cudagraphs "$@" \
306+
--output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_training_cuda_accuracy.csv"
309307

310308
# Run performance test
311309
# Skip dynamo-eager and aot-eager for performance test
312310
# Run performance test for inductor with different configs
313-
# TODO: add more configs here, e.g. dynamic-shapes, max-autotune, etc.
311+
# TODO: add more configs here, e.g. max-autotune, etc.
314312
python "benchmarks/dynamo/$suite.py" \
315313
--performance --cold-start-latency --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \
316314
--output "$TEST_REPORTS_DIR/${backend}_no_cudagraphs_${suite}_${dtype}_training_cuda_performance.csv"
317315
python "benchmarks/dynamo/$suite.py" \
318316
--performance --cold-start-latency --"$dtype" --backend "$backend" "$@" \
319317
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_training_cuda_performance.csv"
318+
python "benchmarks/dynamo/$suite.py" \
319+
--performance --cold-start-latency --"$dtype" --backend "$backend" --dynamic-shapes --disable-cudagraphs "$@" \
320+
--output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_training_cuda_performance.csv"
320321
done
321322
}
322323

@@ -592,9 +593,7 @@ test_distributed() {
592593
"$TORCH_BIN_DIR"/TCPStoreTest --gtest_output=xml:$TEST_REPORTS_DIR/TCPStoreTest.xml
593594

594595
MPIEXEC=$(command -v mpiexec)
595-
# TODO: this is disabled on GitHub Actions until this issue is resolved
596-
# https://github.com/pytorch/pytorch/issues/60756
597-
if [[ -n "$MPIEXEC" ]] && [[ -z "$GITHUB_ACTIONS" ]]; then
596+
if [[ -n "$MPIEXEC" ]]; then
598597
MPICMD="${MPIEXEC} -np 2 $TORCH_BIN_DIR/ProcessGroupMPITest"
599598
eval "$MPICMD"
600599
fi
@@ -874,14 +873,6 @@ elif [[ "$TEST_CONFIG" == deploy ]]; then
874873
elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then
875874
install_huggingface
876875
test_inductor_distributed
877-
elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
878-
test_without_numpy
879-
install_torchvision
880-
test_dynamo_shard 1
881-
test_aten
882-
elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then
883-
install_torchvision
884-
test_dynamo_shard 2
885876
elif [[ "${TEST_CONFIG}" == *huggingface* ]]; then
886877
install_torchvision
887878
install_huggingface
@@ -912,6 +903,14 @@ elif [[ "${TEST_CONFIG}" == *inductor* && "${SHARD_NUMBER}" == 1 ]]; then
912903
install_torchvision
913904
test_inductor
914905
test_inductor_distributed
906+
elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
907+
test_without_numpy
908+
install_torchvision
909+
test_dynamo_shard 1
910+
test_aten
911+
elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then
912+
install_torchvision
913+
test_dynamo_shard 2
915914
elif [[ "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
916915
test_without_numpy
917916
install_torchvision

.flake8

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,15 @@ max-line-length = 120
66
# E501 is not flexible enough, we're using B950 instead
77
ignore =
88
E203,E305,E402,E501,E721,E741,F405,F821,F841,F999,W503,W504,C408,E302,W291,E303,
9+
# fix these lints in the future
10+
E275,
911
# shebang has extra meaning in fbcode lints, so I think it's not worth trying
1012
# to line this up with executable bit
1113
EXE001,
1214
# these ignores are from flake8-bugbear; please fix!
13-
B007,B008,
15+
B007,B008,B017,B019,B020,B023,B024,B026,B027,B028,B903,B904,B905,B906,B907
1416
# these ignores are from flake8-comprehensions; please fix!
15-
C407
17+
C407,C417
1618
# these ignores are from flake8-logging-format; please fix!
1719
G001,G002,G003,G004,G100,G101,G200,G201,G202
1820
per-file-ignores =

.github/auto_request_review.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ reviewers:
1515
- antoniojkim
1616
- wconstab
1717
- SherlockNoMad
18+
Chillee:
19+
- ezyang
1820

1921
files:
2022
# none yet, TODO: migrate CODEOWNERS here

.github/ci_commit_pins/vision.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
18a2e8eb5c6e30e2bc22416379b10f5dfaccc4d4
1+
0387b8821d67ca62d57e3b228ade45371c0af79d

.github/ci_commit_pins/xla.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
015ebcba441dbd5dd21dc02ef12af2c29791a7f0
1+
5444e06e5b851211af8a83e024c6703acfc095eb

.github/merge_rules.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@
8585
- EasyCLA
8686
- Lint
8787
- pull / linux-bionic-py3_8-clang8-xla / build
88-
- pull / linux-bionic-py3_8-clang8-xla / test (xla, 1, 1, linux.2xlarge)
88+
- pull / linux-bionic-py3_8-clang8-xla / test (xla, 1, 1, linux.4xlarge)
8989

9090
- name: Documentation
9191
patterns:

.github/requirements/conda-env-macOS-X64

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
mkl=2021.2.0
22
mkl-include=2021.2.0
3-
numpy=1.18.5
3+
numpy=1.21.2
44
pyyaml=5.3
55
setuptools=46.0.0
66
cmake=3.22.*

0 commit comments

Comments
 (0)