# A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Tampere University
Tampere University
Overview

Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting visual information alone, while completely neglecting the audio track.

To this end, we present Bi-modal Transformer with Proposal Generator (BMT), which efficiently utilizes audio and visual input sequences to select events in a video and, then, use these clips to generate a textual description.

Audio and visual features are encoded with VGGish and I3D while caption tokens with GloVe. First, VGGish and I3D features are passed through the stack of N bi-modal encoder layers where audio and visual sequences are encoded to form, what we call, audio-attended visual and visual-attended audio features. These features are passed to the bi-modal multi-headed proposal generator, which generates a set of proposals using information from both modalities.

Then, the input features are trimmed according to the proposed segments and encoded in the bi-modal encoder again. The stack of N bi-modal decoder layers inputs both: a) GloVe embeddings of the previously generated caption sequence, b) the internal representation from the last layer of the encoder for both modalities. The decoder produces its internal representation which is, then, used in the generator to model the distribution over the vocabulary for the next caption word.

Next, I will introduce the essential parts of the architecture, in particular, Bi-modal Encoder, Bi-modal Decoder, and Bi-modal Proposal Generator. I will assume that you are familiar with the concepts of Transformer and YOLO.

Bi-modal Encoder

We use Bi-modal Encoder in two situations: as an encoder in Bi-modal Transformer, which is used as a captioning module, and as a feature extractor for the Proposal Generation Module. Consequently, when used for the proposal generation, it inputs a whole video and takes only a trimmed segment for the captioning.

Given VGGish and I3D features from a video, Bi-modal Encoder outputs visual-attended audio and audio-attended visual features. These features are going to be useful for the proposal generator and decoder.

Our encoder consists of N layers. Specifically, a bi-modal encoder layer has three sub-layers: 1) Self-attention, 2) Bi-modal Attention, and 3) Position-wise Fully-connected network.

Bi-modal Encoder

Compared to Vanilla Transformer's Encoder, Bi-modal Encoder can fuse features from two modalities. To this end, we propose Bi-modal Attention block. You may think of it as the encoder-decoder attention from the Vanilla Transformer's Decoder, where keys and values are taken from another modality. The other two sub-layers are similar to the Vanilla Transformer's Encoder.

Recall, all blocks inside of the Vanilla Transformer has the same hidden dimension. This choice is convenient for the vanilla transformer as both source and target are text sequences. On the contrary, an application may require the source and target to be of a different kind, e.g., visual-to-text, or, in our case, audio-visual-to-text. More precisely, the features in our application are 128-d for audio (VGGish) and 1024-d for visual (I3D) modalities. Therefore, we allow the bi-modal attention to handle inputs of different dimensions.

Besides distinct dimensions of each feature, the number of these features in a sequence can also be different for each modality. For instance, audio and visual features might have a distinct time span. In our case, it is 0.96 seconds for VGGish and 2.56 for I3D. By design, the bi-modal attention block accounts for this dissimilarity and allows the input sequences to have different lengths.

Bi-modal Decoder

Bi-modal Decoder inputs: a) GloVe embeddings obtained for the previous sequence of tokens (or, roughly, words) and b) the two-stream output of the Nth layer of Bi-modal Encoder. The decoder outputs refined features that account for features at all positions of both streams coming from the encoder and at all previous positions of the input sequence. The output of the decoder will be used to select the next token (a word).

The decoder has N layers, and each layer consists of four sub-layers: 1) Self-attention, 2) Bi-modal Attention, 3) Bridge, and 4) Position-wise Fully-connected network.

Bi-modal Decoder

Compared to Vanilla Transformer's Decoder, Bi-modal Decoder generalizes the encoder-decoder attention into a bi-modal domain. Similar to the encoder, the decoder utilizes the Bi-modal Attention block to attend both input streams individually. The outputs of the attention blocks are fused in the Bridge. More precisely, the bridge layer concatenates both inputs and maps the results into two times a smaller dimension. The other two sub-layers are similar to the original transformer architecture. As well as the encoder, the bi-modal attention in the decoder allows the inputs to have distinct dimensions (GloVe embeddings are 300-d each).

We apply the Bi-modal Transformer for the dense video captioning task, which requires to localize the significant events first and, then, produce a caption for each of the events. For the event localization stage, we introduce Bi-modal Multi-headed Proposal Generator, which generates a set of proposals for a video.

The proposal generator inputs audio-attended visual and visual-attended audio features generated for a full video in the bi-modal encoder. These features are processed by the corresponding stacks of proposal generation heads. Each head has a distinct receptive field $$k$$ and consists of three 1-D conv layers. At every position $$t_\star$$ and for each anchor, the head predicts: center coordinate shift $$\sigma(c)$$, anchor length coefficient $$l$$, and objectness score $$\sigma(o)$$. The predictions from all heads form a common pool of predictions. The proposals are sorted on the confidence score and passed back to clip the input for the captioning module — Figure in Overview.

The design of the proposal generation head is partly inspired by YOLO. Opposed to YOLO, we preserve the sequence length across all layers (no pooling was used). Moreover, YOLO utilizes predictions from three different scales to predict different-scale objects. Hence, only three sizes of receptive fields are used. Instead, our model makes predictions at a single scale while controlling the receptive field with a kernel size $$k$$, which is distinct in each proposal generation head.

Citation
@InProceedings{BMT_Iashin_2020,
title={A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer},
}