Updates | Introduction | Usage | Results&Pretrained Models | Statement |
Image Classification: Please see Usage for a quick start;
Object Detection: Please see ViTAE-Transformer for object detection;
Sementic Segmentation: Please see ViTAE-Transformer for semantic segmentation;
Animal Pose Estimation: Please see ViTAE-Transformer for animal pose estimation;
Matting: Please see ViTAE-Transformer for matting;
Remote Sensing: Please see ViTAE-Transformer for Remote Sensing;
24/03/2021
- The pretrained models for both ViTAE and ViTAEv2 are released. The code for downstream tasks are also provided for reference.
07/12/2021
- The code is released!
19/10/2021
- The paper is accepted by Neurips'2021! The code will be released soon!
06/08/2021
- The paper is post on arxiv! The code will be made public available once cleaned up.
This repository contains the code, models, test results for the paper ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias. It contains several reduction cells and normal cells to introduce scale-invariance and locality into vision transformers. In ViTAEv2, we explore the usage of window attentions without shift operations to obtain a better balance between memory footprint, speed, and performance. We also stack the proposed RC and NC in a multi-stage manner to faciliate the learning on other vision tasks including detection, segmentation, and pose.
Fig.1 - The details of RC and NC design in ViTAE. Fig.2 - The multi-stage design of ViTAEv2.- Clone this repo:
git clone https://github.com/ViTAE-Transformer/ViTAE-Transformer
cd ViTAE-Transformer/Image-Classification- Create a conda virtual environment and activate it:
conda create -n vitae python=3.7 -y
conda activate vitaeconda install pytorch==1.8.1 torchvision==0.9.1 cudatoolkit=10.2 -c pytorch -c conda-forge- Install
timm==0.4.12:
pip install timm==0.4.12- Install
Apex:
git clone https://github.com/NVIDIA/apex
cd apex
git reset --hard a651e2c24ecf97cbf367fd3f330df36760e1c597
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./- Install other requirements:
pip install pyyaml ipdbWe use standard ImageNet dataset, you can download it from http://image-net.org/. The file structure should look like:
$ tree data
imagenet
├── train
│ ├── class1
│ │ ├── img1.jpeg
│ │ ├── img2.jpeg
│ │ └── ...
│ ├── class2
│ │ ├── img3.jpeg
│ │ └── ...
│ └── ...
└── val
├── class1
│ ├── img4.jpeg
│ ├── img5.jpeg
│ └── ...
├── class2
│ ├── img6.jpeg
│ └── ...
└── ...
Take ViTAE_basic_Tiny as an example, to evaluate the pretrained ViTAE model on ImageNet val, run
python validate.py [ImageNetPath] --model ViTAE_basic_Tiny --eval_checkpoint [Checkpoint Path]Take ViTAE_basic_Tiny as an example, to train the ViTAE model on ImageNet with 4 GPU and 128 batch size for each GPU (512 batch size in total), run
python -m torch.distributed.launch --nproc_per_node=4 main.py [ImageNetPath] --model ViTAE_basic_Tiny -b 128 --lr 1e-3 --weight-decay .03 --img-size 224 --ampOur code support multi-node distributed training, and the training scrips of ViTAEv2 variants are given below.
The trained model file will be saved under the output folder
| name | resolution | acc@1 | acc@5 | acc@RealTop-1 | Pretrained |
|---|---|---|---|---|---|
| ViTAE-T | 224x224 | 75.3 | 92.7 | 82.9 | Weights&Log |
| ViTAE-6M | 224x224 | 77.9 | 94.1 | 84.9 | Weights&Log |
| ViTAE-S | 224x224 | 82.0 | 95.9 | 87.0 | Weights&Log |
| ViTAE-B | 224x224 | 83.8 | \ | 89.4 | Weights |
| ViTAE-L | 224x224 | 86.0 | \ | 90.3 | Weights |
| ViTAE-H | 224x224 | 86.9 | \ | 90.6 | Weights |
| ViTAEv2-S | 224x224 | 82.6 | 96.2 | 87.6 | Weights&Log |
| ViTAEv2-48M | 224x224 | 83.8 | 96.6 | 88.4 | Weights&Log |
| ViTAEv2-B | 224x224 | 84.6 | 96.9 | 88.7 | Weights&Log |
| name | resolution | acc@1 | acc@5 | acc@RealTop-1 | Pretrained |
|---|---|---|---|---|---|
| ViTAE-B | 224x224 | 84.8 | \ | 89.9 | Weights |
| ViTAE-L | 224x224 | 87.5 | \ | 90.8 | Weights |
| ViTAE-H | 224x224 | 88.0 | \ | 90.7 | Weights |
| ViTAEv2-B | 224x224 | 86.1 | 97.9 | 89.9 | Weights |
| R@224 | R@448 | R@896 | acc@1 | |
|---|---|---|---|---|
| ViT-S | 1459 | 318 | 48 | 79.9 |
| ViT-B | 803 | 167 | 25 | 81.8 |
| T2T-ViT-14 | 996 | 220 | 33 | 81.2 |
| T2T-ViT-24 | 575 | 118 | 17 | 82.3 |
| Swin-T | 815 | 246 | 60 | 81.3 |
| ViTAEv2-S | 722 | 205 | 46 | 82.6 |
For Resolution 224, 448, and 896, we use batch size 128, 64, and 16 during the measurement.
This project is for research purpose only. For any other questions please contact yufei.xu at outlook.com qmzhangzz at hotmail.com .
@article{xu2021vitae,
title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},
author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},
journal={Advances in Neural Information Processing Systems},
volume={34},
year={2021}
}
@article{zhang2022vitaev2,
title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},
author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
journal={arXiv preprint arXiv:2202.10108},
year={2022}
}
Image Classification: See ViTAE for Image Classification
Object Detection: See ViTAE for Object Detection.
Semantic Segmentation: See ViTAE for Semantic Segmentation.
Animal Pose Estimation: See ViTAE for Animal Pose Estimation.
Matting: See ViTAE for Matting.
Remote Sensing: See ViTAE for Remote Sensing.


