Skip to content

Commit 064675d

Browse files
committed
first commit
1 parent 0f67647 commit 064675d

17 files changed

+3266
-2
lines changed

.gitignore

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
/.idea/
2+
/experiments/
3+
/wandb/
4+
/__pycache__/
5+
scores/
6+
7+
*.pt
8+
*.sh

LICENSE

Lines changed: 443 additions & 0 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 158 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,163 @@
22

33
This repository implements the model proposed in the paper:
44

5-
Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen, **With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition**, *BMVC*, 2021
5+
Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen, **With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition**, BMVC, 2021
66

77
[Project webpage](https://ekazakos.github.io/MTCN-project/)
8-
### Code & models coming soon!
8+
9+
[arXiv paper](https://arxiv.org/abs/2111.01024)
10+
11+
## Citing
12+
When using this code, kindly reference:
13+
14+
```
15+
@INPROCEEDINGS{kazakos2021MTCN,
16+
author={Kazakos, Evangelos and Huh, Jaesung and Nagrani, Arsha and Zisserman, Andrew and Damen, Dima},
17+
booktitle={British Machine Vision Conference (BMVC)},
18+
title={With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition},
19+
year={2021}}
20+
```
21+
22+
## NOTE
23+
Although we train MTCN using visual SlowFast features extracted from a model trained with video clips of 2s, at Table 3 of our paper and Table 1 of Appendix (Table 6 in the arXiv version) where we compare MTCN with SOTA, the results of SlowFast are from [1] where the model is trained with video clips of 1s. In the following table, we provide the results of SlowFast trained with 2s, for a direct comparison as we use this model to extract the visual features.
24+
25+
![alt text](https://ekazakos.github.io/files/slowfast.jpeg)
26+
27+
## Requirements
28+
29+
Project's requirements can be installed in a separate conda environment by running the following command in your terminal: ```$ conda env create -f environment.yml```.
30+
31+
## Features
32+
33+
The extracted features for each dataset can be downloaded using the following links:
34+
35+
### EPIC-KITCHENS-100:
36+
37+
* [Train](https://www.dropbox.com/s/yb9jtzq24cd2hnl/audiovisual_slowfast_features_train.hdf5?dl=0)
38+
* [Val](https://www.dropbox.com/s/8yeb84ewd2meib8/audiovisual_slowfast_features_val.hdf5?dl=0)
39+
* [Test](https://www.dropbox.com/s/6vifpn3qurkyf96/audiovisual_slowfast_features_test.hdf5?dl=0)
40+
41+
### EGTEA:
42+
43+
* [Train-split1](https://www.dropbox.com/s/6hr994w3kkvbtv0/visual_slowfast_features_train_split1.hdf5?dl=0)
44+
* [Test-split1](https://www.dropbox.com/s/03aa8hmflv7depe/visual_slowfast_features_test_split1.hdf5?dl=0)
45+
46+
## Pretrained models
47+
48+
We provide pretrained models for EPIC-KITCHENS-100:
49+
50+
* Audio-visual transformer [link](https://www.dropbox.com/s/vqe7esmqqwsebo6/mtcn_av_sf_epic-kitchens-100.pyth?dl=0)
51+
* Language model [link](https://www.dropbox.com/s/80lcnvsoq4y7tux/mtcn_lm_epic-kitchens-100.pyth?dl=0)
52+
53+
## Ground-truth
54+
55+
* The ground-truth of EPIC-KITCHENS-100 can be found at [this repository](https://github.com/epic-kitchens/epic-kitchens-100-annotations)
56+
* The ground-truth of EGTEA, processed by us to be in a cleaner format, can be downloaded from the following links: [[Train-split1]](https://www.dropbox.com/s/8zxdsi13v7oy106/train_split1.pkl?dl=0) [[Test-split1]](https://www.dropbox.com/s/50bkljl71njyj46/test_split1.pkl?dl=0) [[Action mapping]](https://www.dropbox.com/s/cg0pagu2px0f6k0/actions_egtea.csv?dl=0)
57+
58+
## Train
59+
60+
### EPIC-KITCHENS-100
61+
To train the audio-visual transformer on EPIC-KITCHENS-100, run:
62+
63+
```
64+
python train_av.py --dataset epic-100 --train_hdf5_path /path/to/epic-kitchens-100/features/audiovisual_slowfast_features_train.hdf5
65+
--val_hdf5_path /path/to/epic-kitchens-100/features/audiovisual_slowfast_features_val.hdf5
66+
--train_pickle /path/to/epic-kitchens-100-annotations/EPIC_100_train.pkl
67+
--val_pickle /path/to/epic-kitchens-100-annotations/EPIC_100_validation.pkl
68+
--batch-size 32 --lr 0.005 --optimizer sgd --epochs 100 --lr_steps 50 75 --output_dir /path/to/output_dir
69+
--num_layers 4 -j 8 --classification_mode all --seq_len 9
70+
```
71+
72+
To train the language model on EPIC-KITCHENS-100, run:
73+
```
74+
python train_lm.py --dataset epic-100 --train_pickle /path/to/epic-kitchens-100-annotations/EPIC_100_train.pkl
75+
--val_pickle /path/to/epic-kitchens-100-annotations/EPIC_100_validation.pkl
76+
--verb_csv /path/to/epic-kitchens-100-annotations/EPIC_100_verb_classes.csv
77+
--noun_csv /path/to/epic-kitchens-100-annotations/EPIC_100_noun_classes.csv
78+
--batch-size 64 --lr 0.001 --optimizer adam --epochs 100 --lr_steps 50 75 --output_dir /path/to/output_dir
79+
--num_layers 4 -j 8 --num_gram 9 --dropout 0.1
80+
```
81+
82+
### EGTEA
83+
To train the visual-only transformer on EGTEA (EGTEA does not have audio), run:
84+
85+
```
86+
python train_av.py --dataset egtea --train_hdf5_path /path/to/egtea/features/visual_slowfast_features_train_split1.hdf5
87+
--val_hdf5_path /path/to/egtea/features/visual_slowfast_features_test_split1.hdf5
88+
--train_pickle /path/to/EGTEA_annotations/train_split1.pkl --val_pickle /path/to/EGTEA_annotations/test_split1.pkl
89+
--batch-size 32 --lr 0.001 --optimizer sgd --epochs 50 --lr_steps 25 38 --output_dir /path/to/output_dir
90+
--num_layers 4 -j 8 --classification_mode all --seq_len 9
91+
```
92+
93+
To train the language model on EGTEA,
94+
```
95+
python train_lm.py --dataset egtea --train_pickle /path/to/EGTEA_annotations/train_split1.pkl
96+
--val_pickle /path/to/EGTEA_annotations/test_split1.pkl
97+
--action_csv /path/to/EGTEA_annotations/actions_egtea.csv
98+
--batch-size 64 --lr 0.001 --optimizer adam --epochs 50 --lr_steps 25 38 --output_dir /path/to/output_dir
99+
--num_layers 4 -j 8 --num_gram 9 --dropout 0.1
100+
```
101+
102+
## Test
103+
104+
### EPIC-KITCHENS-100
105+
To test the audio-visual transformer on EPIC-KITCHENS-100, run:
106+
107+
```
108+
python test_av.py --dataset epic-100 --test_hdf5_path /path/to/epic-kitchens-100/features/audiovisual_slowfast_features_val.hdf5
109+
--test_pickle /path/to/epic-kitchens-100-annotations/EPIC_100_validation.pkl
110+
--checkpoint /path/to/av_model/av_checkpoint.pyth --seq_len 9 --num_layers 4 --output_dir /path/to/output_dir
111+
--split validation
112+
```
113+
114+
To obtain scores of the model on the test set, simply use ```--test_hdf5_path /path/to/epic-kitchens-100/features/audiovisual_slowfast_features_test.hdf5```,
115+
```--test_pickle /path/to/epic-kitchens-100-annotations/EPIC_100_test_timestamps.pkl```
116+
and ```--split test``` instead. Since the labels for the test set are not available the script will simply save the scores
117+
without computing the accuracy of the model.
118+
119+
To evaluate your model on the validation set, follow the instructions in [this link](https://github.com/epic-kitchens/C1-Action-Recognition).
120+
In the same link, you can find instructions for preparing the scores of the model for submission in the evaluation server and obtain results
121+
on the test set.
122+
123+
Finally, to filter out improbable sequences using LM, run:
124+
125+
```
126+
python test_av_lm.py --dataset epic-100
127+
--test_pickle /path/to/epic-kitchens-100-annotations/EPIC_100_validation.pkl
128+
--test_scores /path/to/audio-visual-results.pkl
129+
--checkpoint /path/to/lm_model/lm_checkpoint.pyth
130+
--num_gram 9 --split validation
131+
```
132+
Note that, ```--test_scores /path/to/audio-visual-results.pkl``` are the scores predicted from the audio-visual transformer. To obtain scores on the test set, use ```--test_pickle /path/to/epic-kitchens-100-annotations/EPIC_100_test_timestamps.pkl```
133+
and ```--split test``` instead.
134+
135+
Since we are providing the trained models for EPIC-KITCHENS-100, `av_checkpoint.pyth` and `lm_checkpoint.pyth` in the test scripts above could be either the provided pretrained models or `model_best.pyth` that is the your own trained model.
136+
137+
### EGTEA
138+
139+
To test the visual-only transformer on EGTEA, run:
140+
141+
```
142+
python test_av.py --dataset egtea --test_hdf5_path /path/to/egtea/features/visual_slowfast_features_test_split1.hdf5
143+
--test_pickle /path/to/EGTEA_annotations/test_split1.pkl
144+
--checkpoint /path/to/v_model/model_best.pyth --seq_len 9 --num_layers 4 --output_dir /path/to/output_dir
145+
--split test_split1
146+
```
147+
148+
To filter out improbable sequences using LM, run:
149+
```
150+
python test_av_lm.py --dataset egtea
151+
--test_pickle /path/to/EGTEA_annotations/test_split1.pkl
152+
--test_scores /path/to/visual-results.pkl
153+
--checkpoint /path/to/lm_model/model_best.pyth
154+
--num_gram 9 --split test_split1
155+
```
156+
157+
In each case, you can extract attention weights by simply including ```--extract_attn_weights``` at the input arguments of the test script.
158+
159+
## References
160+
[1] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, , Antonino Furnari, Jian Ma,Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, andMichael Wray, **Rescaling Egocentric Vision: Collection Pipeline and Challenges for EPIC-KITCHENS-100**, IJCV, 2021
161+
162+
## License
163+
164+
The code is published under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, found [here](https://creativecommons.org/licenses/by-nc-sa/4.0/).

corpus.py

Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
import pandas as pd
2+
import numpy as np
3+
import torch
4+
import random
5+
6+
from torch.utils.data import Dataset
7+
8+
class Dictionary(object):
9+
def __init__(self):
10+
self.word2idx = {}
11+
self.idx2word = []
12+
self.idx2count = {}
13+
14+
def add_word(self, word):
15+
if word not in self.word2idx:
16+
self.idx2word.append(word)
17+
self.word2idx[word] = len(self.idx2word) - 1
18+
self.idx2count[len(self.idx2word) - 1] = 0
19+
return self.word2idx[word]
20+
21+
def add_count(self, idx):
22+
self.idx2count[idx] += 1
23+
24+
def __len__(self):
25+
return len(self.idx2word)
26+
27+
28+
class EpicCorpus(Dataset):
29+
def __init__(self, pickle_file, csvfiles, num_class, num_gram, train=True):
30+
self.verb_dict, self.noun_dict = Dictionary(), Dictionary()
31+
verb_csv, noun_csv = csvfiles[0], csvfiles[1]
32+
self.num_class = num_class
33+
self.num_gram = num_gram
34+
self.train = train
35+
36+
assert num_gram >= 2
37+
38+
# Update verb & noun dictionary, note that last token is '<mask>' token
39+
with open(verb_csv, 'r') as f:
40+
lines = f.readlines()
41+
for line in lines[1:]:
42+
idx, word = int(line.split(',')[0]), line.split(',')[1]
43+
self.verb_dict.add_word(word)
44+
self.verb_dict.add_word('<mask>')
45+
46+
with open(noun_csv, 'r') as f:
47+
lines = f.readlines()
48+
for line in lines[1:]:
49+
idx, word = int(line.split(',')[0]), line.split(',')[1]
50+
self.noun_dict.add_word(word)
51+
self.noun_dict.add_word('<mask>')
52+
self.verbs, self.nouns = self.tokenize(pd.read_pickle(pickle_file))
53+
54+
def tokenize(self, df_labels):
55+
"""Tokenizes a epic-kitchens file."""
56+
# Parse the pandas file
57+
video_ids = sorted(list(set(df_labels['video_id'])))
58+
verb_idss, noun_idss = [], []
59+
60+
for video_id in video_ids:
61+
df_video = df_labels[df_labels['video_id'] == video_id]
62+
df_video = df_video.sort_values(by='start_frame')
63+
verb_class = list(df_video['verb_class'])
64+
noun_class = list(df_video['noun_class'])
65+
66+
for verbidx in verb_class:
67+
self.verb_dict.add_count(verbidx)
68+
for nounidx in noun_class:
69+
self.noun_dict.add_count(nounidx)
70+
71+
assert len(verb_class) == len(noun_class)
72+
for ii in range(len(verb_class) - self.num_gram + 1):
73+
verb_temp = []
74+
noun_temp = []
75+
for j in range(self.num_gram):
76+
verb_temp.append(verb_class[ii + j])
77+
noun_temp.append(noun_class[ii + j])
78+
verb_idss.append(torch.tensor(verb_temp).type(torch.int64))
79+
noun_idss.append(torch.tensor(noun_temp).type(torch.int64))
80+
81+
verb_ids = torch.stack(verb_idss, dim=0)
82+
noun_ids = torch.stack(noun_idss, dim=0)
83+
84+
assert verb_ids.shape[0] == noun_ids.shape[0]
85+
return verb_ids, noun_ids
86+
87+
def __len__(self):
88+
return len(self.verbs)
89+
90+
def __getitem__(self, index):
91+
verb, noun = self.verbs[index], self.nouns[index]
92+
verb_input, noun_input = verb.clone().detach(), noun.clone().detach()
93+
94+
if self.train:
95+
verb_mask_pos = np.random.choice(list(range(self.num_gram)))
96+
noun_mask_pos = verb_mask_pos
97+
98+
verb_input[verb_mask_pos] = self.verb_dict.word2idx['<mask>']
99+
noun_input[noun_mask_pos] = self.noun_dict.word2idx['<mask>']
100+
101+
else:
102+
# For evaluating, test only the centre action
103+
mask_pos = self.num_gram // 2
104+
verb_input[mask_pos] = self.verb_dict.word2idx['<mask>']
105+
noun_input[mask_pos] = self.noun_dict.word2idx['<mask>']
106+
107+
data = {'verb_input': verb_input, 'verb_target': verb, 'noun_input': noun_input, 'noun_target' : noun}
108+
return data
109+
110+
111+
class EgteaCorpus(Dataset):
112+
def __init__(self, pickle_file, csvfiles, num_class, num_gram, train=True):
113+
self.action_dict = Dictionary()
114+
self.num_class = int(num_class)
115+
self.num_gram = num_gram
116+
self.train = train
117+
action_csv = csvfiles[0]
118+
119+
assert num_gram >= 2
120+
121+
# Update action dictionary, note that last token is '<mask>' token
122+
with open(action_csv, 'r') as f:
123+
lines = f.readlines()
124+
for line in lines:
125+
idx, word = int(line.split(',')[0]), line.split(',')[1]
126+
self.action_dict.add_word(word)
127+
self.action_dict.add_word('<mask>')
128+
self.actions = self.tokenize(pd.read_pickle(pickle_file))
129+
130+
def tokenize(self, df_labels):
131+
"""Tokenizes a epic-kitchens file."""
132+
# Parse the pandas file
133+
video_ids = sorted(list(set(df_labels['video_name'])))
134+
action_idss = []
135+
136+
for video_id in video_ids:
137+
df_video = df_labels[df_labels['video_name'] == video_id]
138+
df_video = df_video.sort_values(by='start_frame')
139+
action_class = list(df_video['action_idx'])
140+
141+
for actionidx in action_class:
142+
self.action_dict.add_count(actionidx)
143+
144+
for ii in range(len(action_class) - self.num_gram + 1):
145+
action_temp = []
146+
for j in range(self.num_gram):
147+
action_temp.append(action_class[ii + j])
148+
action_idss.append(torch.tensor(action_temp).type(torch.int64))
149+
150+
action_ids = torch.stack(action_idss, dim=0)
151+
152+
return action_ids
153+
154+
def __len__(self):
155+
return len(self.actions)
156+
157+
def __getitem__(self, index):
158+
action = self.actions[index]
159+
action_input = action.clone().detach()
160+
161+
if self.train:
162+
mask_pos = np.random.choice(list(range(self.num_gram)))
163+
action_input[mask_pos] = self.action_dict.word2idx['<mask>']
164+
165+
else:
166+
# For evaluating, test only the centre action
167+
mask_pos = self.num_gram // 2
168+
action_input[mask_pos] = self.action_dict.word2idx['<mask>']
169+
170+
data = {'input': action_input, 'target': action}
171+
return data
172+
173+

0 commit comments

Comments
 (0)