R(2+1)D
 
We support 3 flavors of R(2+1)D:
- r2plus1d_18_16_kinetics18-layer R(2+1)D pre-trained on Kinetics 400 (used by default) – it is identical to the torchvision implementation
- r2plus1d_34_32_ig65m_ft_kinetics34-layer R(2+1)D pre-trained on IG-65M and fine-tuned on Kinetics 400 – the weights are provided by moabitcoin/ig65m-pytorch repo for stack/step size- 32.
- r2plus1d_34_8_ig65m_ft_kineticsthe same as the one above but this one was pre-trained with stack/step size- 8
models are pre-trained on RGB frames and follow the plain torchvision augmentation sequence.
Info
The flavors that were pre-trained on IG-65M and fine-tuned on Kinetics 400 yield
significantly better performance than the default model
(e.g. the 32 frame model reaches an accuracy of 79.10 vs 57.50 by default).
By default (model_name=r2plus1d_18_16_kinetics), the model expects to input a stack of 16 RGB frames (112x112),
which spans 0.64 seconds of the video recorded at 25 fps.
In the default case, the features will be of size Tv x 512 where Tv = duration / 0.64.
Specify, model_name, step_size and stack_size to change the default behavior.
Quick Start
Ensure that the environment is properly set up before proceeding. See Setup Environment for detailed instructions.
Activate the environment
and extract features from the ./sample/v_GGSY1Qvo990.mp4 video and show the predicted classes
Supported Arguments
| Argument | Default | Description | 
|---|---|---|
| model_name | "r2plus1d_18_16_kinetics" | A variant of R(2+1)d. "r2plus1d_18_16_kinetics","r2plus1d_34_32_ig65m_ft_kinetics","r2plus1d_34_8_ig65m_ft_kinetics"are supported. | 
| stack_size | null | The number of frames from which to extract features (or window size). If omitted, it will respect the config of model_nameduring training. | 
| step_size | null | The number of frames to step before extracting the next features. If omitted, it will respect the config of model_nameduring training. | 
| extraction_fps | null | If specified (e.g. as 5), the video will be re-encoded to theextraction_fpsfps. Leave unspecified ornullto skip re-encoding. | 
| device | "cuda:0" | The device specification. It follows the PyTorch style. Use "cuda:3"for the 4th GPU on the machine or"cpu"for CPU-only. | 
| video_paths | null | A list of videos for feature extraction. E.g. "[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]"or just one path"./sample/v_GGSY1Qvo990.mp4". | 
| file_with_video_paths | null | A path to a text file with video paths (one path per line). Hint: given a folder ./datasetwith.mp4files one could use:find ./dataset -name "*mp4" > ./video_paths.txt. | 
| on_extraction | print | If print, the features are printed to the terminal. Ifsave_numpyorsave_pickle, the features are saved to either.npyfile or.pkl. | 
| output_path | "./output" | A path to a folder for storing the extracted features (if on_extractionis eithersave_numpyorsave_pickle). | 
| keep_tmp_files | false | If true, the reencoded videos will be kept intmp_path. | 
| tmp_path | "./tmp" | A path to a folder for storing temporal files (e.g. reencoded videos). | 
| show_pred | false | If true, the script will print the predictions of the model on a down-stream task. It is useful for debugging. | 
Example
Make sure the environment is set up correctly. For instructions, refer to Setup Environment.
Start by activating the environment
It will extract R(2+1)d features for two sample videos. The features are going to be extracted with the default parameters.
python main.py \
    feature_type=r21d \
    device="cuda:0" \
    video_paths="[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]"
Here is an example with r2plus1d_34_32_ig65m_ft_kinetics 34-layer R(2+1)D model
that waas pre-trained on IG-65M and, then, fine-tuned on Kinetics 400
python main.py \
    feature_type=r21d \
    model_name="r2plus1d_34_8_ig65m_ft_kinetics" \
    device="cuda:0" \
    video_paths="[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]"
See the config file for other supported parameters. Note, that this implementation of R(2+1)d only supports the RGB stream.
Credits
- The TorchVision implementation.
- The R(2+1)D paper: A Closer Look at Spatiotemporal Convolutions for Action Recognition.
- Thanks to @ohjho we now also support the favors of the 34-layer model pre-trained
on IG-65M and fine-tuned on Kinetics 400.- A shout-out to devs of moabitcoin/ig65m-pytorch who adapted weights of these favors from Caffe to PyTorch.
- The paper where these flavors were presented: Large-scale weakly-supervised pre-training for video action recognition
 
License
The wrapping code is under MIT, yet, it utilizes torchvision library which is under BSD 3-Clause "New" or "Revised" License.