Reinforcement Learning meets Masked Video Modeling : Trajectory-Guided Adaptive Token Selection (ICCVW 2025)
This is the official Pytorch based implementation of Reinforcement Learning meets Masked Video Modeling : Trajectory-Guided Adaptive Token Selection accepted in ICCV 2025 - Long-Video Foundations Workshop.
In this work, we introduce a novel and generalizable Trajectory-Aware Adaptive Token Sampler (TATS), which models the motion dynamics of tokens and can be seamlessly integrated into the masked autoencoder (MAE) framework to select motion-centric tokens in videos. Additionally, we propose a unified training strategy that enables joint optimization of both MAE and TATS from scratch using Proximal Policy Optimization (PPO).
The code has been tested on Pytorch 1.11.0. For other details about pre-requisite libraries go to INSTALL.md.
To Install Trajectory Attention dependencies, refer Motionformer.
For dataset preparation refer to DATASET.md
The dataset (example UCF101) directory should look like following. The first column represents the filepath and the second column represents the label_id (separated by tab) :
UCF101/videos/MoppingFloor/v_MoppingFloor_g20_c04.avi 55
UCF101/videos/HammerThrow/v_HammerThrow_g16_c02.avi 36
UCF101/videos/FrontCrawl/v_FrontCrawl_g23_c04.avi 32
UCF101/videos/CricketBowling/v_CricketBowling_g17_c02.avi 23
UCF101/videos/SkyDiving/v_SkyDiving_g13_c02.avi 83
UCF101/videos/Diving/v_Diving_g10_c01.avi 26
Download list of video names used from here.
All the scripts to run the code are in the folder scripts folder. There would be 4 folders corresponding to 4 datasets UCF101, HMDB51, Kinetics-400, Something-Something-V2.
To pretrain on UCF101 with mask ratio = 0.9
bash scripts/ucf101/adaptive_vidmae_ppo_vit_base_patch16_224_learnable_masking_ratio_0.9_epoch_200/pretrain.sh
pretrain.sh looks like :
# Set the path to save checkpoints
OUTPUT_DIR='pretrain/adaptive_vidmae_ppo_ucf101_sampled_mask_ratio_0.9'
# Set the path to UCF101 train set.
DATA_PATH='/home/ayushrai/UCF101/train.csv'
# batch_size can be adjusted according to number of GPUs
# this script is for 4 GPUs (1 nodes x 4 GPUs)
### mask_type = learnable and mask_ratio = 0.9 ###
OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=2 \
--master_port 12320 \
pretrain_mae_vit_ppo.py \
--data_path ${DATA_PATH} \
--mask_type learnable \
--mask_ratio 0.9 \
--model adaptive_pretrain_videomae_base_patch16_224 \
--decoder_depth 4 \
--batch_size 32 \
--num_frames 16 \
--sampling_rate 2 \
--opt adamw \
--opt_betas 0.9 0.95 \
--warmup_epochs 40 \
--save_ckpt_freq 5 \
--epochs 201 \
--log_dir ${OUTPUT_DIR} \
--output_dir ${OUTPUT_DIR} \
--traj_attention True \
--num_traj_attn_blocks 1 \
--num_epochs_train_mae_only 10 \
--policy_loss_coefficient 1e-4 \
--value_loss_coefficient 1e-4 \
--entropy_coefficient 1e-4 \
--update_steps_mae 1 \
--update_steps_ppo 1 \
--softmax_temp 1.0
Following Arguments have been added :
--traj_attention : True or False (whether to use Trajectory Attention or not).
--num_traj_attn_blocks : number of trajectory attention blocks.
--num_epochs_train_mae_only : Initially train standard VidMAE using reconstruction loss for these many epochs.
--policy_loss_coefficient : coefficient for Policy loss term in PPO objective (c1).
--value_loss_coefficient : coefficient for Value loss term in PPO objective (c2).
--entropy_coefficient : coefficient for entropy term in PPO objective (c3).
--update_steps_mae : # of steps of MAE update and keep collecting {state, action, reward, value} in the memory buffer. This is also the size of memory.
--update_steps_ppo: # of steps of PPO update by sampling from memory buffer and calculating the PPO objective.
--softmax_temp : softmax temperature.
To finetune on UCF101, run finetune.sh
bash scripts/ucf101/adaptive_vidmae_ppo_vit_base_patch16_224_learnable_masking_ratio_0.9_epoch_200/finetune.sh
To finetune on UCF101, run linear_probing.sh
bash scripts/ucf101/adaptive_vidmae_ppo_vit_base_patch16_224_learnable_masking_ratio_0.9_epoch_200/linear_probing.sh
To visualise the mask, run the following command
python visualise_mask.py \
--img_path <Video_File_Path> \
--save_path <Save_Output_Path> \
--model_path <Model Checkpoint Path> \
--mask_type learnable \
--mask_ratio 0.95 \
--sampling_rate 2 \
--traj_attention False \
--softmax_temp 1.0 \
--num_epochs_train_mae_only 10
Please consider citing this work if you find it useful.
@article{rai2025reinforcement,
title={Reinforcement Learning meets Masked Video Modeling: Trajectory-Guided Adaptive Token Selection},
author={Rai, Ayush K and Min, Kyle and Krishna, Tarun and Hu, Feiyan and Smeaton, Alan F and O'Connor, Noel E},
journal={arXiv preprint arXiv:2505.08561},
year={2025}
}
[1] Tong, Z., et al. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. NeurIPS 2022. Paper.
[2] Bandara, W., et al. AdaMAE: Adaptive Masking for Efficient Video Masked Autoencoding. CVPR 2023. Paper.
[3] Patrick, M., et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. NeurIPS 2021. Paper.
[4] Schulman, J., et al. Proximal Policy Optimization Algorithms. arXiv 2017. Paper.


