R(2+1)D
We support 3 flavors of R(2+1)D:
r2plus1d_18_16_kinetics
18-layer R(2+1)D pre-trained on Kinetics 400 (used by default) – it is identical to the torchvision implementationr2plus1d_34_32_ig65m_ft_kinetics
34-layer R(2+1)D pre-trained on IG-65M and fine-tuned on Kinetics 400 – the weights are provided by moabitcoin/ig65m-pytorch repo for stack/step size32
.r2plus1d_34_8_ig65m_ft_kinetics
the same as the one above but this one was pre-trained with stack/step size8
models are pre-trained on RGB frames and follow the plain torchvision augmentation sequence.
Info
The flavors that were pre-trained on IG-65M and fine-tuned on Kinetics 400 yield
significantly better performance than the default model
(e.g. the 32
frame model reaches an accuracy of 79.10 vs 57.50 by default).
By default (model_name=r2plus1d_18_16_kinetics
), the model expects to input a stack of 16 RGB frames (112x112
),
which spans 0.64 seconds of the video recorded at 25 fps.
In the default case, the features will be of size Tv x 512
where Tv = duration / 0.64
.
Specify, model_name
, step_size
and stack_size
to change the default behavior.
Set up the Environment for R(2+1)D
Setup conda
environment. Requirements are in file conda_env.yml
# it will create a new conda environment called 'video_features' on your machine
conda env create -f conda_env.yml
Quick Start
Activate the environment
conda activate video_features
and extract features from the ./sample/v_GGSY1Qvo990.mp4
video and show the predicted classes
python main.py \
feature_type=r21d \
video_paths="[./sample/v_GGSY1Qvo990.mp4]" \
show_pred=true
Supported Arguments
Argument |
Default |
Description |
---|---|---|
model_name |
"r2plus1d_18_16_kinetics" |
A variant of R(2+1)d. "r2plus1d_18_16_kinetics" , "r2plus1d_34_32_ig65m_ft_kinetics" , "r2plus1d_34_8_ig65m_ft_kinetics" are supported. |
stack_size |
null |
The number of frames from which to extract features (or window size). If omitted, it will respect the config of model_name during training. |
step_size |
null |
The number of frames to step before extracting the next features. If omitted, it will respect the config of model_name during training. |
extraction_fps |
null |
If specified (e.g. as 5 ), the video will be re-encoded to the extraction_fps fps. Leave unspecified or null to skip re-encoding. |
device |
"cuda:0" |
The device specification. It follows the PyTorch style. Use "cuda:3" for the 4th GPU on the machine or "cpu" for CPU-only. |
video_paths |
null |
A list of videos for feature extraction. E.g. "[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]" or just one path "./sample/v_GGSY1Qvo990.mp4" . |
file_with_video_paths |
null |
A path to a text file with video paths (one path per line). Hint: given a folder ./dataset with .mp4 files one could use: find ./dataset -name "*mp4" > ./video_paths.txt . |
on_extraction |
print |
If print , the features are printed to the terminal. If save_numpy or save_pickle , the features are saved to either .npy file or .pkl . |
output_path |
"./output" |
A path to a folder for storing the extracted features (if on_extraction is either save_numpy or save_pickle ). |
keep_tmp_files |
false |
If true , the reencoded videos will be kept in tmp_path . |
tmp_path |
"./tmp" |
A path to a folder for storing temporal files (e.g. reencoded videos). |
show_pred |
false |
If true , the script will print the predictions of the model on a down-stream task. It is useful for debugging. |
Example
Start by activating the environment
conda activate video_features
It will extract R(2+1)d features for two sample videos. The features are going to be extracted with the default parameters.
python main.py \
feature_type=r21d \
device="cuda:0" \
video_paths="[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]"
Here is an example with r2plus1d_34_32_ig65m_ft_kinetics
34-layer R(2+1)D model
that waas pre-trained on IG-65M and, then, fine-tuned on Kinetics 400
python main.py \
feature_type=r21d \
model_name="r2plus1d_34_8_ig65m_ft_kinetics" \
device="cuda:0" \
video_paths="[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]"
See the config file for other supported parameters. Note, that this implementation of R(2+1)d only supports the RGB stream.
Credits
- The TorchVision implementation.
- The R(2+1)D paper: A Closer Look at Spatiotemporal Convolutions for Action Recognition.
- Thanks to @ohjho we now also support the favors of the 34-layer model pre-trained
on IG-65M and fine-tuned on Kinetics 400.
- A shout-out to devs of moabitcoin/ig65m-pytorch who adapted weights of these favors from Caffe to PyTorch.
- The paper where these flavors were presented: Large-scale weakly-supervised pre-training for video action recognition
License
The wrapping code is under MIT, yet, it utilizes torchvision
library which is under BSD 3-Clause "New" or "Revised" License.