R(2+1)D

We support 3 flavors of R(2+1)D:

r2plus1d_18_16_kinetics 18-layer R(2+1)D pre-trained on Kinetics 400 (used by default) – it is identical to the torchvision implementation
r2plus1d_34_32_ig65m_ft_kinetics 34-layer R(2+1)D pre-trained on IG-65M and fine-tuned on Kinetics 400 – the weights are provided by moabitcoin/ig65m-pytorch repo for stack/step size 32.
r2plus1d_34_8_ig65m_ft_kinetics the same as the one above but this one was pre-trained with stack/step size 8

models are pre-trained on RGB frames and follow the plain torchvision augmentation sequence.

Info

The flavors that were pre-trained on IG-65M and fine-tuned on Kinetics 400 yield significantly better performance than the default model (e.g. the 32 frame model reaches an accuracy of 79.10 vs 57.50 by default).

By default (model_name=r2plus1d_18_16_kinetics), the model expects to input a stack of 16 RGB frames (112x112), which spans 0.64 seconds of the video recorded at 25 fps. In the default case, the features will be of size Tv x 512 where Tv = duration / 0.64. Specify, model_name, step_size and stack_size to change the default behavior.

Quick Start

Ensure that the environment is properly set up before proceeding. See Setup Environment for detailed instructions.

Activate the environment

conda activate video_features

and extract features from the ./sample/v_GGSY1Qvo990.mp4 video and show the predicted classes

python main.py \
    feature_type=r21d \
    video_paths="[./sample/v_GGSY1Qvo990.mp4]" \
    show_pred=true

Supported Arguments

Argument	Default	Description
`model_name`	`"r2plus1d_18_16_kinetics"`	A variant of R(2+1)d. `"r2plus1d_18_16_kinetics"`, `"r2plus1d_34_32_ig65m_ft_kinetics"`, `"r2plus1d_34_8_ig65m_ft_kinetics"` are supported.
`stack_size`	`null`	The number of frames from which to extract features (or window size). If omitted, it will respect the config of `model_name` during training.
`step_size`	`null`	The number of frames to step before extracting the next features. If omitted, it will respect the config of `model_name` during training.
`extraction_fps`	`null`	If specified (e.g. as `5`), the video will be re-encoded to the `extraction_fps` fps. Leave unspecified or `null` to skip re-encoding.
`device`	`"cuda:0"`	The device specification. It follows the PyTorch style. Use `"cuda:3"` for the 4th GPU on the machine or `"cpu"` for CPU-only.
`video_paths`	`null`	A list of videos for feature extraction. E.g. `"[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]"` or just one path `"./sample/v_GGSY1Qvo990.mp4"`.
`file_with_video_paths`	`null`	A path to a text file with video paths (one path per line). Hint: given a folder `./dataset` with `.mp4` files one could use: `find ./dataset -name "*mp4" > ./video_paths.txt`.
`on_extraction`	`print`	If `print`, the features are printed to the terminal. If `save_numpy` or `save_pickle`, the features are saved to either `.npy` file or `.pkl`.
`output_path`	`"./output"`	A path to a folder for storing the extracted features (if `on_extraction` is either `save_numpy` or `save_pickle`).
`keep_tmp_files`	`false`	If `true`, the reencoded videos will be kept in `tmp_path`.
`tmp_path`	`"./tmp"`	A path to a folder for storing temporal files (e.g. reencoded videos).
`show_pred`	`false`	If `true`, the script will print the predictions of the model on a down-stream task. It is useful for debugging.

Example

Make sure the environment is set up correctly. For instructions, refer to Setup Environment.

Start by activating the environment

conda activate video_features

It will extract R(2+1)d features for two sample videos. The features are going to be extracted with the default parameters.

python main.py \
    feature_type=r21d \
    device="cuda:0" \
    video_paths="[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]"

Here is an example with r2plus1d_34_32_ig65m_ft_kinetics 34-layer R(2+1)D model that waas pre-trained on IG-65M and, then, fine-tuned on Kinetics 400

python main.py \
    feature_type=r21d \
    model_name="r2plus1d_34_8_ig65m_ft_kinetics" \
    device="cuda:0" \
    video_paths="[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]"

See the config file for other supported parameters. Note, that this implementation of R(2+1)d only supports the RGB stream.

Credits

The TorchVision implementation.
The R(2+1)D paper: A Closer Look at Spatiotemporal Convolutions for Action Recognition.
Thanks to @ohjho we now also support the favors of the 34-layer model pre-trained on IG-65M and fine-tuned on Kinetics 400.
- A shout-out to devs of moabitcoin/ig65m-pytorch who adapted weights of these favors from Caffe to PyTorch.
- The paper where these flavors were presented: Large-scale weakly-supervised pre-training for video action recognition

License

The wrapping code is under MIT, yet, it utilizes torchvision library which is under BSD 3-Clause "New" or "Revised" License.