SpecVQGAN - Vladimir Iashin

Overview

The generation of visually relevant, high-quality sounds is a longstanding challenge of deep learning. Solving this challenge would allow sound designers to spend less time searching large foley databases for the sound that is relevant to a specific video scene. Despite promising results shown in [1, 2, 3, 4], generation of long (10+ seconds) and high-quality audio samples supporting a large variety of visual scenes remains to be a challenge. The goal of this work is to bridge this gap.

To achieve this, we propose to tame the visually guided sound generation by shrinking a training dataset of audio spectrograms to a set of representative vectors aka. a codebook. Similar to word tokens in language modeling, these codebook vectors can be used by a transformer to sample a representation that can be easily decoded into a spectrogram. To employ visual information during sampling, we represent video frames as tokens and initialize the target sequence with them. This allows us to sample the next codebook token given the visual information and previously generated codebook tokens. Once autoregressive sampling is done, we remove the visual tokens from the sequence and reuse the pre-trained decoder part of the codebook autoencoder to decode the sampled sequence of codebook tokens into a sound spectrogram.

Overview
Model
- Spectrogram Codebook
- Vision-based Conditional Cross-modal Transformer
Automatic Quality Assessment of Spectrogram Generation
Datasets
Results

Model

The natural format of audio is a waveform, a 1D signal resampled at tens of thousands of Hz. One option is to train a codebook directly on such waveforms and, then, sample the codes from it. Using this approach, OpenAI Jukebox could generate high-quality musical performances based on lyrics, style, and artist conditioning. Despite truly wonderful results, the sampling speed is rather slow (1 hour per 10 secs of generated audio).

Therefore, for efficiency, we operate on spectrograms, a condensed 2D representation of an audio signal which could be easily obtained from a raw waveform and inverted back to it. This also allows us to operate with sound as with images and draw on architectural elements from conditional image generation. In this work, we train a codebook on spectrograms similar to VQGAN (a variant of VQVAE) that proved to possess strong reconstruction power from smaller-scale codebook representations when applied to RGB images.

Once we can reliably reconstruct a spectrogram from a small-scale codebook representation we can train a model to sample from the pre-trained codebook a novel representation given visual conditioning. This representation can be then decoded into a new spectrogram relevant for visual input.

Next, we describe both the spectrogram codebook and the codebook sampler in detail.

Spectrogram Codebook

The Spectrogram Codebook \(\mathcal{Z}\) is trained in an autoencoder with a quantized bottleneck:

The goal of the autoencoder is to reliably reconstruct the input spectrogram \(x\) from a small-scale codebook (quantized) representation \(z_\mathbf{q}.\) The codebook representation \(z_\mathbf{q}\) is obtained from the encoded representation \(\hat{z}\) just by looking up the closest elements from the codebook \(\mathcal{Z}.\) Both the Codebook Encoder (\(E\)) and Codebook Decoder (\(G\)) are generic 2D Conv stacks.

The training of the codebook is guided with four losses, two of which are traditional VQVAE losses: the codebook and reconstruction. The other two are inherited from the VQGAN architecture: patch-based adversarial and perceptual losses (LPIPS). The latter two losses were proven to allow reconstruction from smaller-scale codebook representations (see VQGAN).

Since the LPIPS loss relies on features of an ImageNet-pretrained classifier (VGG-16) it is not reasonable to expect that it would help to guide the generation of spectrograms. The closest relative of VGG-16 in audio classification is VGGish. However, we cannot make use of it because it: a) operates on 10 times shorter spectrograms than it is in our application, b) lack of depth and, thus, downsampling operations, in VGGish prevents extraction of large-scale features that could be useful in separating real and fake spectrograms. We, therefore, train from scratch a variant of VGGish architecture, referred to as VGGish-ish on VGGSound dataset. The new perceptual loss based on VGGish-ish is called LPAPS.

Transformers [1, 2, 3] have shown incredible success in sequence modeling. Once, the codebook representation is reformulated as a sequence, sampling from the codebook might benefit from the expressivity of a transformer. Such an approach has been shown to produce high-quality RGB images based on a depth, a semantic map, or a lower-resolution image in VQGAN. In this work, we propose to train such transformer to bridge two modalities, namely vision, and audio.

We train a variant of GPT-2 to predict the next codebook index given a visual condition in a form of visual tokens (features). The loss (cross-entropy) is calculated by comparing the sampled codebook indices \(\hat{s} = \{\hat{s_j}\}_{j=1}^{K}\) to the indices corresponding to the ground-truth spectrogram codebook representation \(z_\mathbf{q}\) (see the codebook figure ☝️).

On test-time, Transformer \(M\) autoregressively samples a sequence token-by-token primed with visual conditioning (frame-wise video features \(\hat{\mathcal{F}}\)) as follows:

Vision-based Conditional Transformer upto the Decoder

When sampling of tokens is done, we cut out the visual tokens from the generated sequence and replace the predicted codebook indices \(\hat{s}\) with vectors from the codebook \(\mathcal{Z}\) to form the codebook representation \(\hat{z}_\mathbf{q}.\) We reshape the sequence \(\hat{s}\) into the 2D representation \(\hat{z}_\mathbf{q}\) in a column-major way. Then, we can decode this representation into a spectrogram by the pre-trained codebook decoder \(G.\) The generated spectrogram \(\hat{x}_\mathcal{F}\) can finally be transformed to a waveform \(\hat{w}\) with a pre-trained spectrogram vocoder \(V:\)

Overall, the Vision-based Conditional Cross-modal Sampler also includes the pretrained decoder from the codebook autoencoder \(G\) and a pretrained Spectrogram Vocoder \(V:\)

Vision-based Conditional Autoregressive Sampler

The most popular methods for vocoding a spectrogram are the Griffin-Lim algorithm and WaveNet. The Griffin-Lim algorithm is fast and can be applied to an open-domain dataset. However, the quality of the reconstructed waveform is dissatisfactory. At the same time, WaveNet allows to generate high-quality results but at the cost of sampling speed (20+ minutes for a 10-second sample on a GPU). Therefore, we train from scratch MelGAN on an open-domain dataset (VGGSound). This allows us to transform spectrograms to their waveforms in a fraction of a second on a CPU.

Automatic Quality Assessment of Spectrogram Generation

Human evaluation of content generation models is an expensive and tedious procedure. In the image generation field, this problem is bypassed with the automatic evaluation of fidelity using a family of metrics based on an ImageNet-pretrained Inception model. The most popular Inception-based metrics are Inception Score, Fréchet- and Kernel Inception Distance (FID & KID for short). However, automatic evaluation of a sound generation model remains an open question. In addition, our application requires the model to produce not only high-quality but also visually relevant samples. To this end, we propose a family of metrics for fidelity (quality) and relevance evaluation based on a novel architecture called Melception, a variant of Inception, trained as a classifier on a large-scale open-domain dataset (VGGSound).

For Fidelity evaluation of generated spectrograms, Melception can be directly adapted to calculate the whole range of Inception-based metrics including Inception Score as well as FID and KID. For automatic Visual Relevance evaluation, however, there are no well-defined metrics in the existing literature. To design one, we hypothesize that the class distribution of a generated spectrogram given a condition should be close to the class distribution of the original spectrogram for this condition. The class distributions for fake and real spectrograms can be formed by a pre-trained spectrogram classifier. To this end, we rely on Melception-based Kullback–Leibler divergence (MKL) as a measure of "distance" between two class distributions and average it among all samples in the dataset.

Datasets

We demonstrate the generation capabilities of the proposed approach on two datasets with strong audio-visual correspondence: VAS and VGGSound. VAS is a relatively small-scale but manually curated dataset that consists of ~12.5k clips from YouTube. The clips span 8 classes: Dog, Fireworks, Drum, Baby, Gun, Sneeze, Cough, and Hammer.

VGGSound is a large-scale dataset with >190k video clips from YouTube spanning 300+ classes. The classes can be grouped as people, sports, nature, home, tools, vehicles, music, etc. VGGSound is 15 times larger than VAS but it is less curated due to the automatic collecting procedure. To the best of our knowledge, we are the first to apply VGGSound for sound generation.

Results: Spectrogram Reconstruction

Reliable reconstruction of input spectrograms is a necessary condition for high-quality spectrogram generation. We evaluate the reconstruction ability of the spectrogram autoencoder with the proposed set of metrics measuring fidelity (FID) and relevance (average MKL) of reconstructions. Moreover, such an experiment might give us a rough upper bound on the performance of the transformer. When compared to ground truth spectrograms, the reconstructions are expected to have high fidelity (low FID) and to be relevant (low mean MKL):

Trained on	Evaluated on	FID ↓	Melception-based Kullback–Leibler divergence averaged among all samples in the dataset. MKL ↓
VGGSound	VGGSound	1.0	0.8
VGGSound	VAS	3.2	0.7
VAS	VAS	6.0	1.0

The results imply high fidelity and relevance on both VGGSound (test) and VAS (validation) datasets. Notably, the performance of the VGGSound-pretrained codebook is better than of the VAS-pretrained codebook even when applied on the VAS validation set due to larger and more diverse data seen during training.

Next, we show qualitative results by drawing samples randomly from the hold-out sets of VGGSound and VAS and plot the reconstructions along with the ground-truth spectrograms.

Qualitative Results (VGGSound dataset):

Playing Jembe

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Ambulance Siren

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

More Samples

Canary Calling

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Child Speech Kid Speaking

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Chimpanzee Pant-Hooting (silent video)

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Chopping Food

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Civil Defense Siren

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Dog Whimpering

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Door Slamming

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Eating With Cutlery

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Electric Grinder Grinding

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Eletric Blender Running

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Golf Driving

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Ice Cracking

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Magpie Calling

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Mouse Pattering

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Opening Or Closing Drawers

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

People Belly Laughing

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

People Sneezing

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Playing Bagpipes

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Playing Castanets

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Playing Clarinet

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Playing Flute

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Playing Mandolin

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Playing Snare Drum

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Playing Synthesizer

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Playing Tympani

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Playing Vibraphone

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Police Car (Siren)

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Roller Coaster Running

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Shot Football

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Slot Machine

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Spraying Water

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Tap Dancing

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Whale Calling

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Woodpecker Pecking Tree

Ground Truth from VGGSound

Reconstruction with VGGSound Codebook

Qualitative Results (VAS dataset):

Dog

Ground Truth from VAS

Reconstruction with VGGSound Codebook

Reconstruction with VAS Codebook

Baby

Ground Truth from VAS

Reconstruction with VGGSound Codebook

Reconstruction with VAS Codebook

More Samples

Baby

Ground Truth from VAS

Reconstruction with VGGSound Codebook

Reconstruction with VAS Codebook

Baby

Ground Truth from VAS

Reconstruction with VGGSound Codebook

Reconstruction with VAS Codebook

Cough

Ground Truth from VAS

Reconstruction with VGGSound Codebook

Reconstruction with VAS Codebook

Dog

Ground Truth from VAS

Reconstruction with VGGSound Codebook

Reconstruction with VAS Codebook

Drum

Ground Truth from VAS

Reconstruction with VGGSound Codebook

Reconstruction with VAS Codebook

Drum

Ground Truth from VAS

Reconstruction with VGGSound Codebook

Reconstruction with VAS Codebook

Fireworks

Ground Truth from VAS

Reconstruction with VGGSound Codebook

Reconstruction with VAS Codebook

Gun

Ground Truth from VAS

Reconstruction with VGGSound Codebook

Reconstruction with VAS Codebook

Hammer

Ground Truth from VAS

Reconstruction with VGGSound Codebook

Reconstruction with VAS Codebook

Sneeze

Ground Truth from VAS

Reconstruction with VGGSound Codebook

Reconstruction with VAS Codebook

Results: Visually Guided Sound Generation

We benchmark our visually guided generative model using three different settings:
a) the transformer is trained on VGGSound to sample from the VGGSound-pretrained codebook,
b) the transformer is trained on VAS to sample from the VGGSound codebook,
c) the transformer is trained on VAS to sample from the VAS codebook.
We additionally compare different ImageNet-pretrained features: BN-Inception (RGB + flow) and ResNet-50 (RGB). As for spectrogram reconstruction evaluation, we rely on FID and average MKL as our metrics:

Condition		FID ↓	Melception-based Kullback–Leibler divergence averaged among all samples in the dataset. MKL ↓	FID ↓	Melception-based Kullback–Leibler divergence averaged among all samples in the dataset. MKL ↓	FID ↓	Melception-based Kullback–Leibler divergence averaged among all samples in the dataset. MKL ↓	The time (in seconds) it requires to sample a ~9.8-second audio. Timer ↓
No Features		13.5	9.7	33.7	9.6	28.7	9.2	7.7
ResNet	1 Feature	11.5	7.3	26.5	6.7	25.1	6.3	7.7
	5 Features	11.3	7.0	22.3	6.5	20.9	6.1	7.9
	212 Features	10.5	6.9	20.8	6.2	22.6	5.8	11.8
Inception	1 Feature	8.6	7.7	38.6	7.3	25.1	6.6	7.7
	5 Features	9.4	7.0	29.1	6.9	24.8	6.2	7.9
	212 Features	9.6	6.8	20.5	6.0	25.4	5.9	11.8
Codebook		VGGSound		VGGSound		VAS
Sampling for		VGGSound		VAS		VAS
Setting		(a)		(b)		(c)

We observe that:
1) In general, the more features from a corresponding video are used, the better the result in terms of relevance. However, there is a trade-off imposed by the sampling speed which decreases with the size of the conditioning.
2) A large gap (log-scale) in mean MKL between visual and “empty” conditioning suggests the importance of visual conditioning in producing relevant samples.
3) When the sampler and codebook are trained on the same dataset—settings (a) and (c)—the fidelity remains on a similar level if visual conditioning is used. This suggests that it is easier for the model to learn “features-codebook” (visual-audio) correspondence even from just a few features. However, if trained on different datasets (b), the sampler benefits from more visual information.
4) Both BN-Inception and ResNet-50 features achieve comparable performance, with BN-Inception being slightly better on VGGSound and with longer conditioning in each setting. Notably, the ResNet-50 features are RGB-only which significantly eases practical applications. We attribute the small difference between the RGB + flow features and RGB-only features to the fact that ResNet-50 is a stronger architecture than BN-Inception on the ImageNet benchmark.

Generated Samples

Next, we show samples from all three different settings:

(a) – Trained on VGGSound to sample from VGGSound codebook

Trained on VGGSound to sample from VGGSound codebook

•••

female speech,
woman speaking
– VGGSound

Generated Sample Class:
female speech, woman speaking 0.99
child speech, kid speaking 0.00
people whispering 0.00
eating with cutlery 0.00

Trained on VGGSound to sample from VGGSound codebook

•••

chainsawing trees
– VGGSound

Generated Sample Class:
chainsawing trees 0.98
hedge trimmer running 0.01
driving motorcycle 0.00
chopping wood 0.00

Trained on VGGSound to sample from VGGSound codebook

•••

baby
crying
– VGGSound

Generated Sample Class:
baby crying 0.66
people sobbing 0.23
baby babbling 0.08
people babbling 0.03

More Samples

Trained on VGGSound to sample from VGGSound codebook

•••

crowd
cheering
– VGGSound

Generated Sample Class:
people cheering 0.46
people crowd 0.14
people booing 0.13
singing choir 0.13

Trained on VGGSound to sample from VGGSound codebook

•••

playing
bongo
– VGGSound

Generated Sample Class:
playing bongo 0.85
playing congas 0.12
underwater bubbling 0.01
playing banjo 0.01

Trained on VGGSound to sample from VGGSound codebook

•••

playing
bongo
– VGGSound

Generated Sample Class:
ripping paper 0.91
squishing water 0.06
lighting firecrackers 0.01
forging swords 0.00

Trained on VGGSound to sample from VGGSound codebook

•••

playing
bongo
– VGGSound

Generated Sample Class:
wind chime 0.84
playing glockenspiel 0.11
singing bowl 0.02
playing marimba, xylophone 0.01

Use our pre-trained model to generate samples for a custom video:

(b) – Trained on VAS to sample from VGGSound codebook

Trained on VAS to sample from VGGSound codebook

•••

gun
– VAS

Generated Sample Class:
cap gun shooting 0.74
machine gun shooting 0.26
fireworks banging 0.00
chopping wood 0.00

Trained on VAS to sample from VGGSound codebook

•••

dog
– VAS

Generated Sample Class:
dog barking 0.85
dog bow-wow 0.15
dog baying 0.00
coyote howling 0.00

Trained on VAS to sample from VGGSound codebook

•••

cough
– VAS

Generated Sample Class:
people coughing 0.64
people sneezing 0.28
baby laughter 0.06
people belly laughing 0.01

Note: the cough class in the VAS dataset has as few as 314 training videos videos

(c) – Trained on VAS to sample from VAS codebook

Trained on VAS to sample from VAS codebook

•••

drum
– VAS

Generated Sample Class:
playing cymbal 0.57
playing bass drum 0.28
playing drum kit 0.10
playing double bass 0.04

Trained on VAS to sample from VAS codebook

•••

fireworks
– VAS

Generated Sample Class:
fireworks banging 1.00
machine gun shooting 0.00
lighting firecrackers 0.00
playing hockey 0.00

Trained on VAS to sample from VAS codebook

•••

hammer
– VAS

Generated Sample Class:
hammering nails 0.63
arc welding 0.23
forging swords 0.04
playing guiro 0.03

Note: the hammer class in the VAS dataset has as few as 318 training videos videos

Results: Comparison with the State-of-the-art

We compare our model with RegNet, which is currently the strongest baseline in generating relevant sounds for a visual sequence. Since the RegNet approach requires training one model per class in a dataset, it explicitly shrinks the sampling space. On the contrary, our model learns to sample for all classes in a dataset at the same time which is a harder task. For a fair comparison, we will pass a class label (cls) along with the visual features into the model conditioning:

Model	Parameters	FID ↓	Melception-based Kullback–Leibler divergence averaged among all samples in the dataset. MKL ↓	The time (in seconds) it requires to sample a ~9.8-second audio. Timer ↓
Ours (b): The transformer is trained on the VAS dataset to sample from a VGGSound-pretrained codebook (b)	379M	20.5	6.0	12
Ours (c): The transformer is trained on the VAS dataset to sample from a VAS-pretrained codebook (c)	377M	25.4	5.9	12
RegNet	RegNet approach requires to train one model per class. There are 8 classes in VAS. 8\(\times\) 105M	78.8	5.7	1500
Ours (b): The transformer is trained on the VAS dataset to sample from a VGGSound-pretrained codebook (b) + cls	379M	20.2	5.7	12
Ours (c): The transformer is trained on the VAS dataset to sample from a VAS-pretrained codebook (c) + cls	377M	24.9	5.5	12

According to the results, our model significantly outperforms the baseline in terms of fidelity (FID) while being on par or better in generating relevant samples. Moreover, the models without the cls token have competitive performance with RegNet. This suggests that our model is able to learn the mapping between the visual features and the class distribution through the "features-codebook" correspondence. We highlight that our model is trained on all dataset classes at once which is a much harder requirement than the baseline.

Comparison of Generated Samples

•••

baby
– VAS

Generated Sample Class (VGGSound):

RegNet:
people sobbing 0.67
baby crying 0.27
cat growling 0.01

Ours:
people sobbing 0.84
baby crying 0.15
people babbling 0.00

RegNet (Generation Time: 25 minutes on a )

Ours (Generation Time: 10 seconds on a )

More Samples

•••

drum
– VAS

Generated Sample Class (VGGSound):

RegNet:
playing cymbal 0.31
playing snare drum 0.29
playing bass drum 0.14

Ours:
playing bass drum 0.46
playing drum kit 0.34
playing cymbal 0.12

RegNet (Generation Time: 25 minutes on a )

Ours (Generation Time: 10 seconds on a )

•••

dog
– VAS

Generated Sample Class (VGGSound):

RegNet:
dog bow-wow 0.66
dog baying 0.20
dog barking 0.03

Ours:
dog barking 0.72
dog bow-wow 0.27
fox barking 0.00

RegNet (Generation Time: 25 minutes on a )

Ours (Generation Time: 10 seconds on a )

•••

fireworks
– VAS

Generated Sample Class (VGGSound):

RegNet:
fireworks banging 0.94
lighting firecrackers 0.06
footsteps on snow 0.00

Ours:
fireworks banging 0.94
lighting firecrackers 0.03
machine gun shooting 0.02

RegNet (Generation Time: 25 minutes on a )

Ours (Generation Time: 10 seconds on a )

Results: Sampling without Visual Conditioning

The following results show the capability of a model to generate samples without visual condition which, essentially, depicts how well the model captures the distribution of the training set. Both the codebook and the transformer are trained on the same dataset and the samples are not cherry-picked.

Random Set of Generated Samples for VGGSound

Generated Sample Class (VGGSound):

playing bass guitar 0.50
playing electric guitar 0.39
female singing 0.04
male singing 0.01

railroad car, train wagon 0.37
train wheels squealing 0.28
train horning 0.17
train whistling 0.15

horse clip-clop 0.57
golf driving 0.05
people running 0.04
playing tennis 0.01

More Samples

playing trombone 0.92
playing cello 0.04
playing trumpet 0.01
playing double bass 0.00

playing violin, fiddle 0.32
playing cello 0.17
people humming 0.16
playing theremin 0.15

playing drum kit 0.26
people marching 0.21
playing cymbal 0.10
horse clip-clop 0.08

wind noise 0.35
wind rustling leaves 0.16
rowboat, canoe, ..., rowing 0.08
horse clip-clop 0.06

train whistling 0.51
lathe spinning 0.17
people crowd 0.12
rairoad car, train wagon 0.06

⚠️ WARNING: it might be a bit loud ⚠️

playing lacrosse 0.45
roller coaster running 0.09
people sniggering 0.09
playing volleyball 0.08

penguins braying 0.18
elk bugling 0.18
baby crying 0.07
horse neighing 0.06

pheasant crowing 0.21
sliding door 0.10
woodpecker pecking tree 0.10
sharpen knife 0.06

people eating apple 0.45
baby babbling 0.12
people eating 0.03
people babbling 0.03

people marching 0.27
playing washboard 0.22
playing trombone 0.07
orchestra 0.06

using sewing machines 0.22
stream burbling 0.12
hail 0.12
pigeon, dove cooing 0.11

bowling impact 0.52
playing steelpan 0.06
people shuffling 0.06
people slapping 0.05

⚠️ WARNING: it might be a bit loud ⚠️

roller coaster running 0.58
train whistling 0.15
fireworks banging 0.05
machine gun shooting 0.04

⚠️ WARNING: it might be a bit loud ⚠️

eletric blender running 0.49
engine accelerating ... 0.17
car engine starting 0.06
vacuum cleaner cleaning 0.06

⚠️ WARNING: it might be a bit loud ⚠️

sharpen knife 0.56
pig oinking 0.10
rowboat, canoe, ... rowing 0.03
hair dryer drying 0.03

Random Set of Generated Samples for VAS

Generated Sample Class (VGGSound):

dog bow-wow 0.54
dog barking 0.35
dog baying 0.03
dog whimpering 0.03

people marching 0.97
playing bugle 0.01
people battle cry 0.00
playing trombone 0.00

⚠️ WARNING: it might be a bit loud ⚠️

magpie calling 0.32
francolin calling 0.25
turkey gobbling 0.08
goose honking 0.08

More Samples

skateboarding 0.78
striking bowling 0.11
bowling impact 0.04
fireworks banging 0.03

baby laughter 0.39
people belly laughing 0.31
people giggling 0.12
francolin calling 0.06

⚠️ WARNING: it might be a bit loud ⚠️

playing drum kit 0.41
playing bass drum 0.29
roller coaster running 0.11
playing bass guitar 0.02

chicken clucking 0.46
pheasant crowing 0.22
zebra braying 0.06
people belly laughing 0.04

pheasant crowing 0.85
francolin calling 0.05
people belly laughing 0.04
people giggling 0.01

fireworks banging 0.99
lighting firecrackers 0.00
machine gun shooting 0.00
firing cannon 0.00

volcano explosion 0.67
missile launch 0.07
engine accelerating, ... 0.04
train wheels squealing 0.04

squishing water 0.34
chopping wood 0.22
ice cracking 0.12
cap gun shooting 0.11

playing bass drum 0.23
lighting firecrackers 0.13
basketball bounce 0.07
firing muskets 0.06

roller coaster running 0.48
missile launch 0.19
dinosaurs bellowing 0.19
volcano explosion 0.03

roller coaster running 0.27
skiing 0.20
dinosaurs bellowing 0.18
otter growling 0.12

dog barking 0.59
dog bow-wow 0.19
dog growling 0.03
pig oinking 0.02

playing drum kit 0.65
playing timbales 0.20
playing bass drum 0.10
playing cymbal 0.00

lathe spinning 0.32
printer printing 0.29
playing cymbal 0.16
popping popcorn 0.04

machine gun shooting 0.46
car engine starting 0.34
lighting firecrackers 0.04
dinosaurs bellowing 0.03

⚠️ WARNING: it might be a bit loud ⚠️

Results: Priming Sampling with a Ground Truth Part

The transformer, given the previously generated part of the sequence, samples the next token. Here, we show the ability of the model to seamlessly continue a sequence of codebook indices from the original audio. The samples are drawn randomly from the hold-out sets and are not cherry-picked.

Random Set of Samples for VGGSound

codebook codes from the ground truth audio

generated codebook codes

baby crying

bouncing on trampoline

cat meowing

More Samples

codebook codes from the ground truth audio

generated codebook codes

francolin calling

owl hooting

penguins braying

pheasant crowing

playing accordion

playing erhu

playing zither

slot machine

spraying water

Random Set of Samples for VAS

codebook codes from the ground truth audio

generated codebook codes

baby

dog

drum

More Samples

codebook codes from the ground truth audio

generated codebook codes

baby

dog

cough

Note: the cough class in the VAS dataset has as few as 314 training videos videos

drum

fireworks

gun

Results: Controlling for Sample Diversity

Temporal diversity is an important factor of high-fidelity audio. At the same time, it is challenging to generate a diverse but relevant sample which imposes a trade-off. The on-the-surface approach to sample the next item in the sequence is to always pick the codebook item that was predicted with the highest probability of being the next. This approach, however, will result in generating relevant but unpleasant, unnatural, and low-diversity samples.

To avoid it, we can use the distribution for the whole vocabulary (codebook) instead of the top one alone. These distributions can form weights for a multinomial distribution. This simply means that the larger the probability for the item, the more likely it will be picked as the next one.

However, we found that allowing the transformer to sample from all available codebook items indeed improves diversity but, at the same time, deteriorates relevance. Therefore, we end up in the "relevance-diversity" trade-off. To somewhat mitigate this issue, we control the trade-off by clipping the distribution. In particular, instead of either picking only the top one or use all available codebook items, we clip the set of available items to the Top-\(X\) according to their predicted probabilities.

Next, we show how this observation can be used to control the sample diversity:

The next codebook index is sampled from the Top-\(X\) codes

•••

playing accordion
– VGGSound

Generated Sample Class
(VGGSound):

Ground Truth

playing accordion 1.00
playing clarinet 0.00
playing saxophone 0.00
playing harmonica 0.00

\(X=1024\) (all codes)

playing accordion 1.00
playing harmonica 0.00
playing saxophone 0.00
playing clarinet 0.00

512

playing accordion 1.00
playing harmonica 0.00
playing violin, fiddle 0.00
playing bagpipes 0.00

256

playing accordion 1.00
playing harmonica 0.00
playing violin, fiddle 0.00
vehicle horn, ... 0.00

128

playing accordion 1.00
playing harmonica 0.00
playing trumpet 0.00
vehicle horn, ... 0.00

64

playing saxophone 0.47
playing accordion 0.17
female singing 0.09
playing violin, fiddle 0.07

32

playing accordion 1.00
playing trumpet 0.00
playing trombone 0.00
vehicle horn, ... 0.00

16

playing accordion 0.88
playing clarinet 0.11
playing violin, fiddle 0.00
playing saxophone 0.00

8

playing oboe 0.96
playing electronic organ 0.01
playing erhu 0.01
singing bowl 0.01

4

playing trumpet 0.95
playing cornet 0.04
playing clarinet 0.00
playing saxophone 0.00

2

civil defense siren 0.60
vehicle horn, ... 0.18
train horning 0.08
playing accordion 0.04

1 (always the top one)

playing trumpet 0.85
donkey, ass braying 0.10
playing accordion 0.02
playing cornet 0.02

or quantitatively relying on Melception-based FID and average MKL (the lower the better):

Results: Relevance per Class

An ideal conditional generative model is expected to produce relevant examples for every class in a dataset. Here we show how relevant the generated samples are across every class in the VGGSound dataset:

Despite that model performance on a majority of the classes fall into [7 ± 0.7] interval of the MKL yet there is still room for improvement in the capabilities of a model to handle multiple classes which we hope to see in future research.

Results: Variability of Samples

Since the generation of a relevant sound given a set of visual features is an ill-posted problem, a model is expected to produce a variety of relevant samples for the same condition. We show that our model is capable of generating a variety of relevant samples for the same visual condition:

•••

male speech,
man speaking
– VGGSound

Generated Sample Class
(VGGSound):

male speech, man speaking 0.67
..., woman speaking 0.14
people booing 0.01
opening/closing drawers 0.01

people booing 0.46
male speech, man speaking 0.12
people marching 0.09
playing volleyball 0.05

people cheering 0.45
people crowd 0.34
playing lacrosse 0.08
people shuffling 0.00

More Samples

people booing 0.45
people cheering 0.11
playing hockey 0.11
people crowd 0.10

child singing 0.81
female singing 0.07
..., woman speaking 0.03
children shouting 0.02

..., woman speaking 0.96
male speech, man speaking 0.02
playing ukulele 0.00
child speech, kid speaking 0.00

Given the provided visual sequence ("a person is talking with a crowd on the background"), it is difficult to guess why the person turned back to the crowd, e.g. because they were "cheering" or "booing". At the same time, we also notice the limitation of the model, i. e. sometimes it confuses the gender or age of a person. However, we believe these are reasonable mistakes considering the difficulty of the scene.

Results: Spectrogram VQGAN as a Neural Audio Codec

A recent (July, 2021) ArXiv submission, show-cased a VQVAE with the adversarial loss, called SoundStream, on lossy compression of a waveform and reported the state-of-the-art results on the 3 kbps bitrate, which is designed for music and speech datasets. Since our approach includes sampling from a pre-trained codebook, we can employ our Spectrogram VQGAN pre-trained on an open-domain dataset as a neural audio codec without a change. Our approach allows encoding at, approximately, 0.27 kbps bitrate with the VGGSound codebook and 0.19 kbps with the VAS codebook.

We provide a small qualitative comparison of reconstructions of Lyra (only speech), SoundStream (only speech and music), and Spectrogram VQGAN (open-domain). We will use the same 3-second samples as provided on the SoundStream project page since the source code for the SoundStream has not been released to the public (checked on 20 October, 2021; the authors promised to release it as a part of Lyra toolbox).

Reference

Spectrogram VQGAN (0.27 kbps)

SoundStream (3 kbps)

Lyra (3 kbps)

Music

Speech

As a result, despite having one order of magnitude smaller bitrate budget, Spectrogram VQGAN achieves comparable performance with SoundStream in reconstruction quality on music data and produces significantly better reconstructions than Lyra. However, as we observed before (Section 4.2 in the paper), Spectrogram VQGAN struggles with the fine details of human speech due to the audio preprocessing (mel-scale spectrogram) and absence of narrow domain pre-training as in Lyra and SoundStream. We highlight that Spectrogram VQGAN is trained on an open-domain hundred-class dataset (VGGSound) while SoundStream is trained on music and speech datasets separately.

Try our model on a custom audio on Google Colab:

Or with an even simplier interface without any exposed code:

Acknowledgements

Funding for this research project was provided by the Academy of Finland projects 327910 & 324346. We also acknowledge CSC – IT Center for Science, Finland, for computational resources.

Taming Visually Guided Sound Generation