Figure. The proposed Multi-modal Dense Video Captioning (MDVC) framework. Given an input consisting of several modalities, namely, audio, speech, and visual, internal representations are produced by a corresponding feature transformer (middle). Then, the features are fused in the multi-modal generator (right) that outputs the distribution over the vocabulary.
Figure. The feature transformer architecture that consists of an encoder (bottom part) and a decoder (top part). The encoder inputs pre-processed and position-encoded features from I3D (in case of the visual modality), and outputs an internal representation. The decoder, in turn, is conditioned on both position-encoded caption that is generated so far and the output of the encoder. Finally, the decoder outputs its internal representation.