CLIP
The CLIP features are extracted at each frame of the provided video.
CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs.
We use CLIP's official augmentations and extract vision features from its image encoder.
The implementation uses the OpenAI CLIP.
The extracted features are going to be of size num_frames x 512
.
We additionally output timesteps in ms for each feature and fps of the video.
Set up the Environment for CLIP
Setup conda
environment. Requirements are in file conda_env.yml
# it will create a new conda environment called 'video_features' on your machine
conda env create -f conda_env.yml
Quick Start
Activate the environment
conda activate video_features
and extract features at 1 fps from ./sample/v_GGSY1Qvo990.mp4
video and output results.
python main.py \
feature_type="clip" \
model_name="ViT-B/32" \
extraction_fps=1 \
video_paths="[./sample/v_GGSY1Qvo990.mp4]" \
on_extraction="print"
Supported Arguments
Argument |
Default |
Description |
---|---|---|
model_name |
"ViT-B/32" |
A variant of CLIP. "ViT-B/16" , "RN50x16" , "RN50x4" , "RN101" , "RN50" , and "custom" are supported. |
batch_size |
1 |
You may speed up extraction of features by increasing the batch size as much as your GPU permits. |
extraction_fps |
null |
If specified (e.g. as 5 ), the video will be re-encoded to the extraction_fps fps. Leave unspecified or null to skip re-encoding. |
device |
"cuda:0" |
The device specification. It follows the PyTorch style. Use "cuda:3" for the 4th GPU on the machine or "cpu" for CPU-only. |
video_paths |
null |
A list of videos for feature extraction. E.g. "[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]" or just one path "./sample/v_GGSY1Qvo990.mp4" . |
file_with_video_paths |
null |
A path to a text file with video paths (one path per line). Hint: given a folder ./dataset with .mp4 files one could use: find ./dataset -name "*mp4" > ./video_paths.txt . |
on_extraction |
print |
If print , the features are printed to the terminal. If save_numpy or save_pickle , the features are saved to either .npy file or .pkl . |
output_path |
"./output" |
A path to a folder for storing the extracted features (if on_extraction is either save_numpy or save_pickle ). |
keep_tmp_files |
false |
If true , the reencoded videos will be kept in tmp_path . |
tmp_path |
"./tmp" |
A path to a folder for storing temporal files (e.g. reencoded videos). |
show_pred |
false |
If true , the script will print the predictions of the model on a down-stream task. It is useful for debugging. |
pred_texts |
null |
If show_pred=true , the texts specified in pred_texts are used for zero-shot classification (e.g. pred_texts="['a dog smiles', 'a woman is lifting']" ). If pred_texts is unspecified, Kinetics 400 classes will be used. |
Examples
Start by activating the environment
conda activate video_features
It is pretty much the same procedure as with other features.
Here we take ViT/B-32
as an example, but we also support ViT-B/16
, RN50x16
, RN50x4
, RN101
, RN50
and others in OpenAI CLIP implementation.
In addition, if you want to use your weights, you need to copy your weights to
models/clip/checkpoints
, rename it CLIP-custom.pth
and specify model_name=custom
.
python main.py \
feature_type="clip" \
model_name="ViT-B/32" \
device="cuda:0" \
video_paths="[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]"
If you would like to save the features, use --on_extraction save_numpy
(or save_pickle
) – by default, the features are saved in ./output/
or where --output_path
specifies.
In the case of frame-wise features, besides features, it also saves timestamps in ms
and the original fps of the video into the same folder with features.
python main.py \
feature_type="clip" \
model_name="ViT-B/32" \
device="cuda:0" \
on_extraction=save_numpy \
file_with_video_paths=./sample/sample_video_paths.txt
We may increase the extraction speed with batching. Therefore, frame-wise features have --batch_size
argument, which defaults to 1
.
python main.py \
feature_type="clip" \
model_name="ViT-B/32" \
device="cuda:0" \
batch_size=128 \
video_paths="[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]"
If you would like to extract features at a certain fps, add --extraction_fps
argument
python main.py \
feature_type="clip" \
model_name="ViT-B/32" \
device="cuda:0" \
extraction_fps=5 \
video_paths="[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]"
If you would like to verify the extracted features, you can
set show_pred="true"
and provide several sentences with pred_texts
argument.
The value of pred_texts
should be a list of strings.
The probability that each frame corresponds to all the sentences you provide will be output.
python main.py \
feature_type="clip" \
model_name="ViT-B/32" \
device="cuda:0" \
extraction_fps=1 \
show_pred="true" \
pred_texts="['a dog smiles', 'a woman is lifting']" \
video_paths="[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]"
You will get the output for each frame like:
Logits | Prob. | Label
23.061 | 0.962 | a dog smiles
19.824 | 0.038 | a woman is lifting
Logits | Prob. | Label
22.770 | 0.963 | a dog smiles
19.520 | 0.037 | a woman is lifting
Logits | Prob. | Label
24.619 | 0.929 | a dog smiles
22.048 | 0.071 | a woman is lifting
...
Logits | Prob. | Label
30.966 | 1.000 | a woman is lifting
15.272 | 0.000 | a dog smiles
Logits | Prob. | Label
32.671 | 1.000 | a woman is lifting
15.413 | 0.000 | a dog smiles
Logits | Prob. | Label
32.555 | 1.000 | a woman is lifting
16.151 | 0.000 | a dog smiles
...
You may also leave pred_texts
unspecified or null
(None) if you wish to apply CLIP for zero-shot prediction
on Kinetics 400.
Credits
- The OpenAI CLIP implementation.
- The CLIP paper
- Thanks to @Kamino666 who adapted this model for
video_features
License
The OpenAI CLIP implementation code is under MIT.