Vladimir Iashin
/vla-DEE-meer/ /YA-sheen/

I'm a deep learning engineer at Revolut, where I research and develop foundation models for fintech applications.

Earlier, I was a postdoctoral researcher at the Visual Geometry Group (VGG) at the University of Oxford, where I worked with Andrew Zisserman on multimodal computer vision, with a focus on combining video, audio, and text.

Prior to that, I completed my PhD (with distinction) in EECS at Tampere University, supervised by Esa Rahtu. Even earlier, I did an MSc (with distinction) in Applied Mathematics and Computer Science, along with a BSc in Economics from HSE University.

In my free time, I enjoy hitting the gym, and searching for terroir in coffee.

[Google Scholar] • [GitHub] • [X/Twitter] • [LinkedIn]

Selected Publications

PRAGMA: Revolut Foundation Model
Maxim Ostroukhov, Ruslan Mikhailov, Vladimir Iashin, and others
Technical Report, 2026
[Paper]

Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder
Vladimir Iashin, Horace Lee, Dan Schofield, and Andrew Zisserman
Computer Vision for Ecological and Biodiversity Monitoring Workshop, ICIP, 2025
[Project Page] • [Code] • [Paper]

SAGANet: Video Object Segmentation-aware Audio Generation
Ilpo Viertola, Vladimir Iashin, Esa Rahtu
GCPR, 2025 (Oral)
[Project Page] • [Code] • [Paper]

Temporally Aligned Audio for Video with Autoregression
Ilpo Viertola, Vladimir Iashin, Esa Rahtu
ICASSP, 2025 (Oral)
[Project Page] • [Code] • [Paper]

Synchformer: Efficient Synchronization from Sparse Cues
Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman
ICASSP, 2024
[Project Page] • [Code] • [Paper]

Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors
Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman
BMVC, 2022 (Spotlight)
[Project Page] • [Code] • [Paper] • [Presentation]

Taming Visually Guided Sound Generation
Vladimir Iashin and Esa Rahtu
BMVC, 2021 (Oral)
[Project Page] • [Code] • [Paper] • [Presentation]

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer
Vladimir Iashin and Esa Rahtu
BMVC, 2020
[Project Page] • [Code] • [Paper] • [Presentation]

Multi-modal Dense Video Captioning
Vladimir Iashin and Esa Rahtu
Multimodal Learning Workshop, CVPR, 2020
[Project Page] • [Code] • [Paper] • [Presentation]

Community Service

Reviewer
AAAI 2023, CVPR 2022, ICCV 2021, TPAMI 2020, and conference workshops in 2024 & 2021

Misc

Video Features
Enables seamless feature extraction and optical flow frame extraction from raw videos with multi-GPU acceleration through a user-friendly and flexible API.

Object Detector
Discover the contents of your uploaded image effortlessly.
The detector is based on YOLOv3 and implemented in PyTorch. The computation is done on a cloud server which runs a Flask application (see the Note) HuggingFace Spaces with Gradio UI.
[Code] • [Note]

ITC Wiki
During my PhD studies, I organized a crowd-sourced wiki which helps with common internal questions such as how to set up remote access to office GPU machines.

IDE Customization
A note about VSCode customization.