Reinforcement Learning meets Masked Video Modeling : Trajectory-Guided Adaptive Token Selection (ICCVW 2025)

This is the official Pytorch based implementation of Reinforcement Learning meets Masked Video Modeling : Trajectory-Guided Adaptive Token Selection accepted in ICCV 2025 - Long-Video Foundations Workshop.

In this work, we introduce a novel and generalizable Trajectory-Aware Adaptive Token Sampler (TATS), which models the motion dynamics of tokens and can be seamlessly integrated into the masked autoencoder (MAE) framework to select motion-centric tokens in videos. Additionally, we propose a unified training strategy that enables joint optimization of both MAE and TATS from scratch using Proximal Policy Optimization (PPO).

Method

Installation

The code has been tested on Pytorch 1.11.0. For other details about pre-requisite libraries go to INSTALL.md.

To Install Trajectory Attention dependencies, refer Motionformer.

Data Preparation

For dataset preparation refer to DATASET.md

The dataset (example UCF101) directory should look like following. The first column represents the filepath and the second column represents the label_id (separated by tab) :

UCF101/videos/MoppingFloor/v_MoppingFloor_g20_c04.avi	55
UCF101/videos/HammerThrow/v_HammerThrow_g16_c02.avi	36
UCF101/videos/FrontCrawl/v_FrontCrawl_g23_c04.avi	32
UCF101/videos/CricketBowling/v_CricketBowling_g17_c02.avi	23
UCF101/videos/SkyDiving/v_SkyDiving_g13_c02.avi	83
UCF101/videos/Diving/v_Diving_g10_c01.avi	26

Download list of video names used from here.

Scripts

All the scripts to run the code are in the folder scripts folder. There would be 4 folders corresponding to 4 datasets UCF101, HMDB51, Kinetics-400, Something-Something-V2.

For pretraining

To pretrain on UCF101 with mask ratio = 0.9

bash scripts/ucf101/adaptive_vidmae_ppo_vit_base_patch16_224_learnable_masking_ratio_0.9_epoch_200/pretrain.sh

pretrain.sh looks like :

# Set the path to save checkpoints
OUTPUT_DIR='pretrain/adaptive_vidmae_ppo_ucf101_sampled_mask_ratio_0.9'
# Set the path to UCF101 train set. 
DATA_PATH='/home/ayushrai/UCF101/train.csv'

# batch_size can be adjusted according to number of GPUs
# this script is for 4 GPUs (1 nodes x 4 GPUs)

### mask_type = learnable and mask_ratio = 0.9 ###
OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=2 \
        --master_port 12320 \
        pretrain_mae_vit_ppo.py \
        --data_path ${DATA_PATH} \
        --mask_type learnable \
        --mask_ratio 0.9 \
        --model adaptive_pretrain_videomae_base_patch16_224 \
        --decoder_depth 4 \
        --batch_size 32 \
        --num_frames 16 \
        --sampling_rate 2 \
        --opt adamw \
        --opt_betas 0.9 0.95 \
        --warmup_epochs 40 \
        --save_ckpt_freq 5 \
        --epochs 201 \
        --log_dir ${OUTPUT_DIR} \
        --output_dir ${OUTPUT_DIR} \
        --traj_attention True \
        --num_traj_attn_blocks 1 \
        --num_epochs_train_mae_only 10 \
        --policy_loss_coefficient 1e-4 \
        --value_loss_coefficient 1e-4 \
        --entropy_coefficient 1e-4 \
        --update_steps_mae 1 \
        --update_steps_ppo 1 \
        --softmax_temp 1.0

Following Arguments have been added :

--traj_attention : True or False (whether to use Trajectory Attention or not).
--num_traj_attn_blocks : number of trajectory attention blocks.
--num_epochs_train_mae_only : Initially train standard VidMAE using reconstruction loss for these many epochs.
--policy_loss_coefficient : coefficient for Policy loss term in PPO objective (c1).
--value_loss_coefficient : coefficient for Value loss term in PPO objective (c2).
--entropy_coefficient : coefficient for entropy term in PPO objective (c3).
--update_steps_mae : # of steps of MAE update and keep collecting {state, action, reward, value} in the memory buffer. This is also the size of memory.
--update_steps_ppo: # of steps of PPO update by sampling from memory buffer and calculating the PPO objective.
--softmax_temp : softmax temperature.

For Finetuning

To finetune on UCF101, run finetune.sh

bash scripts/ucf101/adaptive_vidmae_ppo_vit_base_patch16_224_learnable_masking_ratio_0.9_epoch_200/finetune.sh

For Linear Evaluation

To finetune on UCF101, run linear_probing.sh

bash scripts/ucf101/adaptive_vidmae_ppo_vit_base_patch16_224_learnable_masking_ratio_0.9_epoch_200/linear_probing.sh

Mask Visualization

To visualise the mask, run the following command

python visualise_mask.py \
        --img_path <Video_File_Path> \
        --save_path <Save_Output_Path> \
        --model_path <Model Checkpoint Path> \
        --mask_type learnable \
        --mask_ratio 0.95 \
        --sampling_rate 2 \
        --traj_attention False \
        --softmax_temp 1.0 \
        --num_epochs_train_mae_only 10

Citation

Please consider citing this work if you find it useful.

@article{rai2025reinforcement,
  title={Reinforcement Learning meets Masked Video Modeling: Trajectory-Guided Adaptive Token Selection},
  author={Rai, Ayush K and Min, Kyle and Krishna, Tarun and Hu, Feiyan and Smeaton, Alan F and O'Connor, Noel E},
  journal={arXiv preprint arXiv:2505.08561},
  year={2025}
}

📚 References

[1] Tong, Z., et al. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. NeurIPS 2022. Paper.

[2] Bandara, W., et al. AdaMAE: Adaptive Masking for Efficient Video Masked Autoencoding. CVPR 2023. Paper.

[3] Patrick, M., et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021. Paper.

[4] Schulman, J., et al. Proximal Policy Optimization Algorithms. arXiv 2017. Paper.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Motionformer		Motionformer
figs		figs
models		models
msc		msc
scripts		scripts
README.md		README.md
datasets.py		datasets.py
finetune_class.py		finetune_class.py
functional.py		functional.py
gumbel_sampling.py		gumbel_sampling.py
kinetics.py		kinetics.py
linear_probing_class.py		linear_probing_class.py
masking_generator.py		masking_generator.py
optim_factory.py		optim_factory.py
pretrain_mae_vit_ppo.py		pretrain_mae_vit_ppo.py
rand_augment.py		rand_augment.py
random_erasing.py		random_erasing.py
ssv2.py		ssv2.py
transforms.py		transforms.py
video_transforms.py		video_transforms.py
visualise_mask.py		visualise_mask.py
volume_transforms.py		volume_transforms.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforcement Learning meets Masked Video Modeling : Trajectory-Guided Adaptive Token Selection (ICCVW 2025)

Method

Installation

Data Preparation

Scripts

For pretraining

For Finetuning

For Linear Evaluation

Mask Visualization

Citation

📚 References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning meets Masked Video Modeling : Trajectory-Guided Adaptive Token Selection (ICCVW 2025)

Method

Installation

Data Preparation

Scripts

For pretraining

For Finetuning

For Linear Evaluation

Mask Visualization

Citation

📚 References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages