Seminars and Talks

Supercharging Multimodal Video Representations
by Rohit Girdhar
Date: Friday, Jan. 19
Time: 16:00
Location: Online Call via Zoom

Our guest speaker is Rohit Girdhar from the GenAI Research group, Meta.

You are all cordially invited to the CVG Seminar on January 19th at 4 pm CET

  • via Zoom (passcode is 659431).


Last few years have seen an explosion in the capabilities of representations learned by large models trained on lots of data. From LLMs like GPT4 for natural language processing, to multimodal models like CLIP or Flamingo for visual reasoning, or even text-to-image models like DALLE-3 for image generation; these models have revolutionized the way computers understand these different modalities. One modality, however, has somewhat been left behind—videos. While GPT4V and DALLE-3 have made huge strides in image understanding and generation, understanding or generating videos is still an open problem. What are the reasons for this, and will video representations ever catch up? I believe that instead of thinking of this as a competition between videos and the other modalities, the strong language, image, or generative representations should instead be viewed as an asset for bootstrapping strong video representations. In this talk, I will share some of my recent work in building better video representations, by leveraging these advanced representations, specifically for the tasks of video understanding, multimodal understanding, and video generation.


Rohit is a Research Scientist in the GenAI Research group at Meta. His current research focuses on understanding and generating multimodal data, using minimal human supervision. He obtained a MS and PhD in Robotics from Carnegie Mellon University, where he worked on learning from and understanding videos. He was previously part of the Facebook AI Research (FAIR) group at Meta, and has spent time at DeepMind, Adobe and Facebook as an intern. His research has won multiple international challenges, and has been recognized through a Best Paper (Finalist) Award at CVPR’22, Best Paper Award at ICCV’19 HVU Workshop, Siebel Scholarship at CMU, and a Gold Medal and Research Award for undergraduate research at IIIT Hyderabad.

Segmenting Objects without Manual Supervision
by Laurynas Karazija
Date: Friday, Jan. 12
Time: 14:30
Location: Online Call via Zoom

Our guest speaker is Laurynas Karazija from the Visual Geometry Group, University of Oxford.

You are all cordially invited to the CVG Seminar on January 12th at 2:30 pm CET

  • via Zoom (passcode is 043728).


Detecting, localising and representing objects comprising the visual world is an important and interesting problem with many downstream applications. Today's systems are supervised, relying on extensive and expensive manual annotations. In this talk, I will introduce some recent works that explore learning from appearance, motion and language in an unsupervised or weakly-supervised manner. In particular, I will focus on the drawbacks of appearance-based object-centric models, explain how to teach segmentation networks using optical flow in an end-to-end manner and show how pretrained generative diffusion models can be used to synthesise segmenters directly by sampling and representing objects and their context.


Laurynas Karazija is a PhD student at the Visual Geometry Group at the University of Oxford, UK, working with Prof Andrea Vedaldi, Prof Christian Rupprecht and Dr Iro Laina. He focuses on learning to understand and decompose the visual world into distinct objects with as little supervision as possible.

Three Views on View Synthesis
by Kyle Sargent
Date: Friday, Dec. 15
Time: 16:00
Location: Online Call via Zoom

Our guest speaker is Kyle Sargent from Stanford Vision Lab.

You are all cordially invited to the CVG Seminar on December 15th at 4 pm CET

  • via Zoom (passcode is 520944).


Novel view synthesis from a single image is an important problem in computer vision. Several sources of randomness and ill-posedness make the problem extremely challenging. I will present three papers from over the course of my research career, each taking a very different perspective and technical approach to this problem. As the talk progresses, I will explain how I have come to regard 3D generative modeling and 3D novel view synthesis as closely connected, and give supporting evidence. The final paper I will present is ZeroNVS: Zero-shot 360-degree View Synthesis from a Single Real Image, my most recent paper, which is currently in submission.


Kyle Sargent is a second year PhD student in the Stanford Vision Lab, advised by Jiajun Wu and Fei-Fei Li. He works on 3D generative models and novel view synthesis. He has written several papers for top vision conferences. This includes two first or co-first authored Best Paper Finalists, at CVPR2022 and ICCV2023. Prior to joining Stanford, he was an AI Resident at Google Research, and prior to that, he was an undergraduate at Harvard.

Using Deep Generative Models for Representation Learning and Beyond
by Daiqing Li
Date: Thursday, Dec. 7
Time: 16:00
Location: Online Call via Zoom

Our guest speaker is Daiqing Li from Playground.

You are all cordially invited to the CVG Seminar on December 7th at 4 pm CET

  • via Zoom (passcode is 102781).


Diffusion-based deep generative models have demonstrated remarkable performance in text condition synthesis tasks in images, videos, and 3D. In this talk, I will talk about how to use large-scale T2I models as vision foundation models for representation learning and other downstream tasks, such as synthetic dataset generation and semantic segmentation.


Daiqing Li is currently serving as a research lead at Playground, where their primary focus lies in advancing the realm of pixel foundation models. Previously, Daiqing held the position of senior research scientist at the NVIDIA Toronto AI Lab. In this capacity, their research encompassed a broad spectrum, including computer vision, computer graphics, generative models, and machine learning. He collaborates closely with Sanja Fidler and Antonio Torralba in NVIDIA and several of his works have been integrated into NVIDIA products, notably Omniverse and Clara. Daiqing graduated from the University of Toronto and has been recognized as the runner-up for the MICCAI Young Scientist Awards. His recent research focuses on using generative models for dataset synthesis, perception tasks, and representation learning. He is the author of SemanticGAN, BigDatasetGAN, and DreamTeacher.

Event-based optical flow and stereo depth estimation using contrast maximization
by Guillermo Gallego
Date: Monday, Apr. 17
Time: 09:00
Location: N10_302, Institute of Computer Science

Our guest speaker is Dr. Guillermo Gallego from TU Berlin.

You are all cordially invited to the CVG Seminar on April 17th at 9 am CET


Event cameras are novel vision sensors that mimic functions from the human retina and offer potential advantages over traditional cameras (low latency, high speed, high dynamic range, etc.). They acquire visual information in the form of pixel-wise brightness changes, called events. This talk presents event processing approaches for motion estimation in computer vision and robotics applications. In particular, we will discuss recent advances by the Robotic Interactive Perception Lab at TU Berlin in extending the contrast maximization framework to optical flow and stereo depth estimation.


Guillermo Gallego is Associate Professor at TU Berlin and the Einstein Center Digital Future, Berlin, Germany. He is also a PI of the Science of Intelligence Excellence Cluster. He received the PhD degree in Electrical and Computer Engineering from the Georgia Institute of Technology, USA, in 2011. From 2011 to 2014 he was a Marie Curie researcher with Universidad Politecnica de Madrid, Spain, and from 2014 to 2019 he was a postdoctoral researcher with the Robotics and Perception Group at the University of Zurich, Switzerland. He serves as Associate Editor for IEEE Trans. on Pattern Analysis and Machine Intelligence, IEEE Robotics and Automation Letters and the International Journal of Robotics Research.