Image-Classification

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Current applications

Image Classification: Please see Usage for a quick start;

Object Detection: Please see ViTAE-Transformer for object detection;

Sementic Segmentation: Please see ViTAE-Transformer for semantic segmentation;

Animal Pose Estimation: Please see ViTAE-Transformer for animal pose estimation;

Matting: Please see ViTAE-Transformer for matting;

Remote Sensing: Please see ViTAE-Transformer for Remote Sensing;

Updates

24/03/2021

The pretrained models for both ViTAE and ViTAEv2 are released. The code for downstream tasks are also provided for reference.

07/12/2021

The code is released!

19/10/2021

The paper is accepted by Neurips'2021! The code will be released soon!

06/08/2021

The paper is post on arxiv! The code will be made public available once cleaned up.

Introduction

This repository contains the code, models, test results for the paper ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias. It contains several reduction cells and normal cells to introduce scale-invariance and locality into vision transformers. In ViTAEv2, we explore the usage of window attentions without shift operations to obtain a better balance between memory footprint, speed, and performance. We also stack the proposed RC and NC in a multi-stage manner to faciliate the learning on other vision tasks including detection, segmentation, and pose.

Fig.1 - The details of RC and NC design in ViTAE.

Fig.2 - The multi-stage design of ViTAEv2.

Usage

Install

Clone this repo:

git clone https://github.com/ViTAE-Transformer/ViTAE-Transformer
cd ViTAE-Transformer/Image-Classification

Create a conda virtual environment and activate it:

conda create -n vitae python=3.7 -y
conda activate vitae

conda install pytorch==1.8.1 torchvision==0.9.1 cudatoolkit=10.2 -c pytorch -c conda-forge

Install timm==0.4.12:

pip install timm==0.4.12

Install Apex:

git clone https://github.com/NVIDIA/apex
cd apex
git reset --hard a651e2c24ecf97cbf367fd3f330df36760e1c597
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Install other requirements:

pip install pyyaml ipdb

Data Prepare

We use standard ImageNet dataset, you can download it from http://image-net.org/. The file structure should look like:

$ tree data
imagenet
├── train
│   ├── class1
│   │   ├── img1.jpeg
│   │   ├── img2.jpeg
│   │   └── ...
│   ├── class2
│   │   ├── img3.jpeg
│   │   └── ...
│   └── ...
└── val
    ├── class1
    │   ├── img4.jpeg
    │   ├── img5.jpeg
    │   └── ...
    ├── class2
    │   ├── img6.jpeg
    │   └── ...
    └── ...

Evaluation

Take ViTAE_basic_Tiny as an example, to evaluate the pretrained ViTAE model on ImageNet val, run

python validate.py [ImageNetPath] --model ViTAE_basic_Tiny --eval_checkpoint [Checkpoint Path]

Training

Take ViTAE_basic_Tiny as an example, to train the ViTAE model on ImageNet with 4 GPU and 128 batch size for each GPU (512 batch size in total), run

python -m torch.distributed.launch --nproc_per_node=4 main.py [ImageNetPath] --model ViTAE_basic_Tiny -b 128 --lr 1e-3 --weight-decay .03 --img-size 224 --amp

Our code support multi-node distributed training, and the training scrips of ViTAEv2 variants are given below.

ViTAEv2_S, ViTAEv2_48M, ViTAEv2_B

The trained model file will be saved under the output folder

Results

Main Results on ImageNet-1K with pretrained models

name	resolution	acc@1	acc@5	acc@RealTop-1	Pretrained
ViTAE-T	224x224	75.3	92.7	82.9	Weights&Log
ViTAE-6M	224x224	77.9	94.1	84.9	Weights&Log
ViTAE-S	224x224	82.0	95.9	87.0	Weights&Log
ViTAE-B	224x224	83.8	\	89.4	Weights
ViTAE-L	224x224	86.0	\	90.3	Weights
ViTAE-H	224x224	86.9	\	90.6	Weights
ViTAEv2-S	224x224	82.6	96.2	87.6	Weights&Log
ViTAEv2-48M	224x224	83.8	96.6	88.4	Weights&Log
ViTAEv2-B	224x224	84.6	96.9	88.7	Weights&Log

Models with ImageNet-22K pretraining

name	resolution	acc@1	acc@5	acc@RealTop-1	Pretrained
ViTAE-B	224x224	84.8	\	89.9	Weights
ViTAE-L	224x224	87.5	\	90.8	Weights
ViTAE-H	224x224	88.0	\	90.7	Weights
ViTAEv2-B	224x224	86.1	97.9	89.9	Weights

The performance with few-shot learning

Fig.3 - Use 1%, 10%, and 100% data for finetuning the ViTAE variants.

Inference speed comparison with ViTAEv2 under different resolutions

	R@224	R@448	R@896	acc@1
ViT-S	1459	318	48	79.9
ViT-B	803	167	25	81.8
T2T-ViT-14	996	220	33	81.2
T2T-ViT-24	575	118	17	82.3
Swin-T	815	246	60	81.3
ViTAEv2-S	722	205	46	82.6

For Resolution 224, 448, and 896, we use batch size 128, 64, and 16 during the measurement.

Statement

This project is for research purpose only. For any other questions please contact yufei.xu at outlook.com qmzhangzz at hotmail.com .

Citing ViTAE and ViTAEv2

@article{xu2021vitae,
  title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},
  author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}
@article{zhang2022vitaev2,
  title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},
  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
  journal={arXiv preprint arXiv:2202.10108},
  year={2022}
}

Name		Name	Last commit message	Last commit date
parent directory ..
Utils		Utils
figs		figs
vitae		vitae
vitaev2		vitaev2
.gitignore		.gitignore
README.md		README.md
main.py		main.py
utils.py		utils.py
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Current applications

Updates

Introduction

Usage

Install

Data Prepare

Evaluation

Training

Results

Main Results on ImageNet-1K with pretrained models

Models with ImageNet-22K pretraining

The performance with few-shot learning

Inference speed comparison with ViTAEv2 under different resolutions

Statement

Citing ViTAE and ViTAEv2

Other Links

FilesExpand file tree

Image-Classification

Directory actions

More options

Directory actions

More options

Latest commit

History

Image-Classification

Folders and files

parent directory

README.md

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Current applications

Updates

Introduction

Usage

Install

Data Prepare

Evaluation

Training

Results

Main Results on ImageNet-1K with pretrained models

Models with ImageNet-22K pretraining

The performance with few-shot learning

Inference speed comparison with ViTAEv2 under different resolutions

Statement

Citing ViTAE and ViTAEv2

Other Links