Seminars and Talks

Sparse-view 3D in the Wild
by Jason Y. Zhang
Date: Friday, Apr. 26
Time: 16:00
Location: Online Call via Zoom

Our guest speaker is Jason Y. Zhang from Carnegie Mellon University.

You are all cordially invited to the CVG Seminar on April 26th at 4 pm CEST

  • via Zoom (passcode is 003713).


Reconstructing 3D scenes and objects from images alone has been a long-standing goal in computer vision. However, typical methods require a large number of images with precisely calibrated camera poses, which is cumbersome for end users. We propose a probabilistic framework that can predict distributions over relative camera rotations. These distributions are then composed into coherent camera poses given sparse image sets. To improve precision, we then propose a diffusion-based model that represents camera poses as a distribution over rays instead of camera extrinsics. We demonstrate that our system is capable of recovering accurate camera poses from a variety of self-captures and is sufficient for high-quality 3D reconstruction.


Jason Y. Zhang is a final-year PhD student at Carnegie Mellon University, advised by Deva Ramanan and Shubham Tulsiani. Jason completed his undergraduate degree at UC Berkeley, where he worked with Jitendra Malik and Angjoo Kanazawa. He is interested in scaling single-view and multi-view 3D to unconstrained environments. Jason is supported in part by the NSF GRFP.

Understanding and Harnessing Foundation Models
by Narek Tumanyan
Date: Friday, Mar. 22
Time: 14:30
Location: Online Call via Zoom

Our guest speaker is Narek Tumanyan from the Weizmann Institute of Science.

You are all cordially invited to the CVG Seminar on March 22nd at 2:30 pm CET

  • via Zoom (passcode is 696673).


The field of computer vision has been undergoing a paradigm shift, moving from task-specific models to "foundation models" - large-scale networks trained on a massive amount of data that can be adopted to a variety of downstream tasks. However, current state-of-the-art foundation models are largely "black boxes". That is, despite being successfully leveraged for downstream tasks, the underlying mechanisms which are responsible for their performance are not well understood. In this talk, we will study the internal representations of two prominent foundation models: DINO-ViT - a self-supervised vision transformer, and StableDiffusion - a text-to-image generative latent diffusion model. This will enable us to

  1. Unveil novel visual descriptors;
  2. Devise efficient frameworks of semantic image manipulation based on the novel visual descriptors.

We demonstrate how gaining understanding of internal representations enables a more creative usage of foundation models and expands their capacities to a broader set of tasks.


I am a PhD student at the Weizmann Institute of Science, Faculty of Mathematics and Computer Science, advised by Tali Dekel. My research is focused on analyzing and understanding the internal representations of large-scale models and leveraging them as priors for downstream tasks in images and videos, such as image manipulation, editing, and point tracking. I have completed my Master’s degree at the Weizmann Institute in Tali Dekel's lab, where I also started my PhD in March of 2023.

Towards Perceptually-Enabled Task Assistants
by Ehsan Elhamifar
Date: Wednesday, Mar. 13
Time: 11:00
Location: N10_302, Institute of Computer Science

Our guest speaker is Prof. Ehsan Elhamifar from the Khoury College of Computer Sciences, Northeastern University.

You are all cordially invited to the CVG Seminar on March 13th at 11:00 am CET


Humans perform a wide range of complex activities, such as cooking hour-long recipes, assembling and repairing devices and performing surgeries. Many of these activities are procedural: they consist of sequences of steps that must be followed to achieve the desired goals. Learning complex procedures from videos of humans performing them allows us to design intelligent task assistants, robots and coaching platforms that perform or guide people through tasks. In this talk, we present new neural architectures as well as learning and inference frameworks to understand complex activity videos, addressing the following challenges:

  1. Procedural videos are long, uncurated and contain many task-irrelevant activities, with different videos showing different ways of performing the same task.
  2. Gathering framewise video annotation is costly and not scalable to many videos and tasks.
  3. At inference time, we must accurately recognize actions as data arrive in real-time, especially with only a few frames


Ehsan Elhamifar is an Associate Professor in the Khoury College of Computer Sciences, the director of the Mathematical Data Science (MCADS) Lab and the Director of MS in AI at Northeastern University. He has broad research interests in computer vision, machine learning and AI. The overarching goal of his research is to develop AI that learns from and makes inferences about data analogous to humans. He is a recipient of the DARPA Young Faculty Award. Prior to Northeastern, he was a postdoctoral scholar in the EECS department at UC Berkeley. He obtained his PhD in ECE at the Johns Hopkins University (JHU) and received two Masters degrees, one in EE from Sharif University of Technology in Iran and another in Applied Mathematics and Statistics from JHU.

Dense FixMatch: a Simple Semi-supervised Learning Method for Pixel-wise Prediction Tasks
by Atsuto Maki
Date: Friday, Feb. 2
Time: 14:30
Location: N10_302, Institute of Computer Science

Our guest speaker is Prof. Atsuto Maki from the KTH Royal Institute of Technology.

You are all cordially invited to the CVG Seminar on February 2nd at 2:30 p.m. CET


We discuss Dense FixMatch, a simple method for online semi-supervised learning of dense and structured prediction tasks combining pseudo-labeling and consistency regularization via strong data augmentation. It is an application of FixMatch enabled beyond image classification by adding a matching operation on the pseudo-labels. This allows us to still use the full strength of data augmentation pipelines, including geometric transformations. We evaluated it on semi-supervised semantic segmentation on Cityscapes and Pascal VOC with different percentages of labeled data, and ablated design choices and hyper-parameters. Dense FixMatch significantly improves results compared to supervised learning using only labeled data, approaching its performance with 1/4 of the labeled samples.

[1] Dense FixMatch: a simple semi-supervised learning method for pixel-wise prediction tasks [link]

[2] An analysis of over-sampling labeled data in semi-supervised learning with FixMatch [link]


Atsuto Maki is a Professor of Computer Science at KTH Royal Institute of Technology, Sweden. He obtained BEng and MEng in electrical engineering from Kyoto University and the University of Tokyo, respectively, and his PhD degree in computer science from KTH. Previously he was an associate professor at the Graduate School of Informatics, Kyoto University, and then a senior researcher at Toshiba’s Cambridge Research Lab in the UK. His research interests cover a broad range of topics in machine learning, deep learning, and computer vision, including motion and object recognition, clustering, subspace analysis, stereopsis, and representation learning. He has been serving as a program committee member at major computer vision conferences, e.g. as an area chair of ICCV and ECCV.

Supercharging Multimodal Video Representations
by Rohit Girdhar
Date: Friday, Jan. 19
Time: 16:00
Location: Online Call via Zoom

Our guest speaker is Rohit Girdhar from the GenAI Research group, Meta.

You are all cordially invited to the CVG Seminar on January 19th at 4 pm CET

  • via Zoom (passcode is 659431).


Last few years have seen an explosion in the capabilities of representations learned by large models trained on lots of data. From LLMs like GPT4 for natural language processing, to multimodal models like CLIP or Flamingo for visual reasoning, or even text-to-image models like DALLE-3 for image generation; these models have revolutionized the way computers understand these different modalities. One modality, however, has somewhat been left behind—videos. While GPT4V and DALLE-3 have made huge strides in image understanding and generation, understanding or generating videos is still an open problem. What are the reasons for this, and will video representations ever catch up? I believe that instead of thinking of this as a competition between videos and the other modalities, the strong language, image, or generative representations should instead be viewed as an asset for bootstrapping strong video representations. In this talk, I will share some of my recent work in building better video representations, by leveraging these advanced representations, specifically for the tasks of video understanding, multimodal understanding, and video generation.


Rohit is a Research Scientist in the GenAI Research group at Meta. His current research focuses on understanding and generating multimodal data, using minimal human supervision. He obtained a MS and PhD in Robotics from Carnegie Mellon University, where he worked on learning from and understanding videos. He was previously part of the Facebook AI Research (FAIR) group at Meta, and has spent time at DeepMind, Adobe and Facebook as an intern. His research has won multiple international challenges, and has been recognized through a Best Paper (Finalist) Award at CVPR’22, Best Paper Award at ICCV’19 HVU Workshop, Siebel Scholarship at CMU, and a Gold Medal and Research Award for undergraduate research at IIIT Hyderabad.