Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

PWC

Updates | Introduction | Usage | Results&Pretrained Models | Statement |

Current applications

Image Classification: Please see Usage for a quick start;

Object Detection: Please see ViTAE-Transformer for object detection;

Sementic Segmentation: Please see ViTAE-Transformer for semantic segmentation;

Animal Pose Estimation: Please see ViTAE-Transformer for animal pose estimation;

Matting: Please see ViTAE-Transformer for matting;

Remote Sensing: Please see ViTAE-Transformer for Remote Sensing;

Updates

24/03/2021

  • The pretrained models for both ViTAE and ViTAEv2 are released. The code for downstream tasks are also provided for reference.

07/12/2021

  • The code is released!

19/10/2021

  • The paper is accepted by Neurips'2021! The code will be released soon!

06/08/2021

  • The paper is post on arxiv! The code will be made public available once cleaned up.

Introduction

This repository contains the code, models, test results for the paper ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias. It contains several reduction cells and normal cells to introduce scale-invariance and locality into vision transformers. In ViTAEv2, we explore the usage of window attentions without shift operations to obtain a better balance between memory footprint, speed, and performance. We also stack the proposed RC and NC in a multi-stage manner to faciliate the learning on other vision tasks including detection, segmentation, and pose.

Fig.1 - The details of RC and NC design in ViTAE.

Fig.2 - The multi-stage design of ViTAEv2.

Usage

Install

  • Clone this repo:
git clone https://github.com/ViTAE-Transformer/ViTAE-Transformer
cd ViTAE-Transformer/Image-Classification
  • Create a conda virtual environment and activate it:
conda create -n vitae python=3.7 -y
conda activate vitae
conda install pytorch==1.8.1 torchvision==0.9.1 cudatoolkit=10.2 -c pytorch -c conda-forge
  • Install timm==0.4.12:
pip install timm==0.4.12
  • Install Apex:
git clone https://github.com/NVIDIA/apex
cd apex
git reset --hard a651e2c24ecf97cbf367fd3f330df36760e1c597
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
  • Install other requirements:
pip install pyyaml ipdb

Data Prepare

We use standard ImageNet dataset, you can download it from http://image-net.org/. The file structure should look like:

$ tree data
imagenet
├── train
│   ├── class1
│   │   ├── img1.jpeg
│   │   ├── img2.jpeg
│   │   └── ...
│   ├── class2
│   │   ├── img3.jpeg
│   │   └── ...
│   └── ...
└── val
    ├── class1
    │   ├── img4.jpeg
    │   ├── img5.jpeg
    │   └── ...
    ├── class2
    │   ├── img6.jpeg
    │   └── ...
    └── ...

Evaluation

Take ViTAE_basic_Tiny as an example, to evaluate the pretrained ViTAE model on ImageNet val, run

python validate.py [ImageNetPath] --model ViTAE_basic_Tiny --eval_checkpoint [Checkpoint Path]

Training

Take ViTAE_basic_Tiny as an example, to train the ViTAE model on ImageNet with 4 GPU and 128 batch size for each GPU (512 batch size in total), run

python -m torch.distributed.launch --nproc_per_node=4 main.py [ImageNetPath] --model ViTAE_basic_Tiny -b 128 --lr 1e-3 --weight-decay .03 --img-size 224 --amp

Our code support multi-node distributed training, and the training scrips of ViTAEv2 variants are given below.

ViTAEv2_S, ViTAEv2_48M, ViTAEv2_B

The trained model file will be saved under the output folder

Results

Main Results on ImageNet-1K with pretrained models

name resolution acc@1 acc@5 acc@RealTop-1 Pretrained
ViTAE-T 224x224 75.3 92.7 82.9 Weights&Log
ViTAE-6M 224x224 77.9 94.1 84.9 Weights&Log
ViTAE-S 224x224 82.0 95.9 87.0 Weights&Log
ViTAE-B 224x224 83.8 \ 89.4 Weights
ViTAE-L 224x224 86.0 \ 90.3 Weights
ViTAE-H 224x224 86.9 \ 90.6 Weights
ViTAEv2-S 224x224 82.6 96.2 87.6 Weights&Log
ViTAEv2-48M 224x224 83.8 96.6 88.4 Weights&Log
ViTAEv2-B 224x224 84.6 96.9 88.7 Weights&Log

Models with ImageNet-22K pretraining

name resolution acc@1 acc@5 acc@RealTop-1 Pretrained
ViTAE-B 224x224 84.8 \ 89.9 Weights
ViTAE-L 224x224 87.5 \ 90.8 Weights
ViTAE-H 224x224 88.0 \ 90.7 Weights
ViTAEv2-B 224x224 86.1 97.9 89.9 Weights

The performance with few-shot learning

Fig.3 - Use 1%, 10%, and 100% data for finetuning the ViTAE variants.

Inference speed comparison with ViTAEv2 under different resolutions

R@224 R@448 R@896 acc@1
ViT-S 1459 318 48 79.9
ViT-B 803 167 25 81.8
T2T-ViT-14 996 220 33 81.2
T2T-ViT-24 575 118 17 82.3
Swin-T 815 246 60 81.3
ViTAEv2-S 722 205 46 82.6

For Resolution 224, 448, and 896, we use batch size 128, 64, and 16 during the measurement.

Statement

This project is for research purpose only. For any other questions please contact yufei.xu at outlook.com qmzhangzz at hotmail.com .

Citing ViTAE and ViTAEv2

@article{xu2021vitae,
  title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},
  author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}
@article{zhang2022vitaev2,
  title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},
  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
  journal={arXiv preprint arXiv:2202.10108},
  year={2022}
}

Other Links

Image Classification: See ViTAE for Image Classification

Object Detection: See ViTAE for Object Detection.

Semantic Segmentation: See ViTAE for Semantic Segmentation.

Animal Pose Estimation: See ViTAE for Animal Pose Estimation.

Matting: See ViTAE for Matting.

Remote Sensing: See ViTAE for Remote Sensing.