Theses

A list of completed theses and new thesis topics from the Computer Vision Group.

Are you about to start a BSc or MSc thesis? Please read our instructions for preparing and delivering your work.

Below we list possible thesis topics for Bachelor and Master students in the areas of Computer Vision, Machine Learning, Deep Learning and Pattern Recognition. The project descriptions leave plenty of room for your own ideas. If you would like to discuss a topic in detail, please contact the supervisor listed below and Prof. Paolo Favaro to schedule a meeting. Note that for MSc students in Computer Science it is required that the official advisor is a professor in CS.

AI-Based Image Quality Assessment for MRI and CT Scans in Radiology

Level: Master

June 2025

Context and Motivation:

Medical imaging plays a critical role in diagnostic workflows, but the diagnostic utility of MRI and CT scans can be significantly degraded by various types of quality issues. These include patient or organ motion artifacts (see, for example, the blurry edges caused by the heartbeat in the figure below), poor contrast, scanner miscalibration, and incorrect scan volume positioning or coverage. Currently, image quality checks are often performed manually by radiologists or technicians, which is time-consuming and subjective or sometimes too late. There is a growing need for robust, automated and instant tools that can perform image quality assessment (IQA) reliably and in real-time to aid clinical decision-making and reduce the need for repeated scans.

Thesis Objective:

The goal of this thesis is to develop a machine learning-based system that automatically assesses the quality of MRI and/or CT scans. The model should detect and categorize common quality issues such as:

Patient motion artifacts (blurring, ghosting, ringing)
Scan misalignments (head/body not centered in scan volume)
Field-of-view clipping or partial coverage of anatomy
Noise and low signal-to-noise ratio (SNR)
Beam hardening artifacts (in CT)
Metal or implant-induced artifacts
Contrast agent timing issues (e.g., poor vascular contrast)
Slice thickness inconsistencies or spacing artifacts
and others

The model should output a quality score and, where possible, localize regions with detected artifacts.

Data:

The Department of Diagnostic, Interventional and Pediatric Radiology (DIPR) will provide a dataset of anonymized MRI and CT scans, along with partial annotations indicating known quality issues. These may include binary or categorical labels per scan, or image-level artifact annotations. Additional manual annotations may be curated as needed for evaluation and training.

Tasks:

Literature review on existing approaches to medical image quality assessment and artifact classification.
Data preparation: preprocessing and augmentation, designing train/test splits.
Model development: investigate and implement CNNs, transformers, or hybrid architectures for volumetric data; evaluate existing pre-trained models.
Multi-label classification and localization: develop techniques for identifying and attributing multiple artifacts in a single scan.
Evaluation: using standard metrics (AUC, F1, mean Average Precision), possibly stratified by artifact type.
User-facing output: produce interpretable visualizations (e.g., saliency maps) to explain detected quality issues.

Requirements:

Strong background in machine learning and deep learning
Experience with medical imaging data and 3D image processing is a plus
Interest in healthcare applications and interdisciplinary collaboration

Deliverables:

A trained and evaluated AI model for scan quality assessment
An annotated dataset and data curation tools
A comprehensive thesis document describing methodology, experiments, and results
Source code and user documentation

Expected Impact:

This work aims to reduce the burden on radiologists, improve patient throughput, and decrease repeat scan rates by flagging unusable scans early. The project also has potential for real-world deployment in clinical imaging pipelines.

Supervisors

Department of Computer Science (INF)

Prof. Dr. Paolo Favaro, email: paolo.favaro@unibe.ch

Department of Diagnostic, Interventional and Pediatric Radiology (DIPR)

Prof. Dr. med. Hendrik von Tengg-Kobligk, email: Hendrik.vonTengg@insel.ch

Unconventional augmentation of crops on AI-generated backgrounds to improve the performance of Crop Canopy Cover in Canola and Faba Bean Fields

Level: Master

March 2025

Background
Precision agriculture has emerged as a critical approach to enhancing agricultural productivity and sustainability. In this context, semantic segmentation, a computer vision technique that classifies each pixel in an image into a predefined category, holds immense potential for monitoring crop health and development. However, models are usually bound to specific conditions during experiment acquisition and often suffer from generalization. This thesis aims at improving the generalization by exploring the design of unconventional augmentation techniques. One of the proposed techniques is to use AI-generated backgrounds from multiple scenarios and overlay masks of plants and weeds and compare the performance with other conventional techniques.

Aim

Create unconventional augmentation techniques such as mixing randomly overlapping cropped masks on backgrounds generated using generative models simulating multiple scenarios (invitro, field, drone).
Develop semantic segmentation models to automate the detection of faba beens and colza.
Evaluate the performance of the proposed model using a dataset of field images acquired over time and with different augmentation scenario.

Methodology

Data Collection: we already have a big dataset of images annotated with segmentation masks.
Data Preprocessing: Preprocess the images to ensure consistency in size, format, and color calibration.
Model Development: Employ a deep learning architecture, such as U-Net or DeepLabV3+, for semantic segmentation.
Model Training: Train the model using the preprocessed image dataset, optimizing hyperparameters for improved performance.
Model Evaluation: Evaluate the trained model's performance on a separate validation dataset, assessing its accuracy and robustness.
Analysis in context: the results of the developed models will be used by other researchers’ experts in the agricultural field to draw conclusions about their research.

Requirements

Interest/Experience with image processing.
Python programming knowledge (Pytorch bonus).

Language
English.

Starting date
As soon as possible.

Duration
6 months minimum (can be increased to 12 month if student is interested).

Supervisors
Paolo Favaro, Hassan-Roland Nasser.

Institutes
Computer Vison Group, Posieux (home office possible).

Contact
Hassan-Roland Nasser, hassan-roland.nasser@agroscope.admin.ch .

Analyzing 15TB of pigs videos to determine the relative positions of heads and tails of pigs in order to assess the aggressive behavior

Level: Master

March 2025

Background
The welfare of pigs in commercial barns is a cornerstone for what it bears from economic and ethical point of view. On the one hand, commercial barns are increasingly aware of the pig’s welfare by sustaining less aggressive environment for their pigs. On the other hand, the higher the welfare in a barn, the higher the economic return (better meat quality, less incident, …). For this, commercial barns usually monitor aggressive pigs and exclude them from their next breeding programs. Manual assessment is often resources hungry and not consistent as it is operator depends. The aim of this thesis is to exploit part of the 15TB video dataset that we have in order to find clues about aggressive behavior in pigs.

Aim

Review an object detection model we created and improve it.
Create a projection map from the acquired videos to the barn map.
From the Object detection model, obtain a proxy dataset of heads and tails position in the actual barn.
Analyze the relative motions of head and tails and explore the relationship between the speed of head-to-tail motion in collaboration with a expert animal scientist.

Methodology

Data discovery: we already have a big dataset of 15TB of videos over 4 cameras.
Model Development: Employ a deep learning architecture, such as FasterRCNN or Yolov8.
Deep Learning Model Training and evaluation: Train the model using the preprocessed image dataset, optimizing hyperparameters for improved performance.
Comparing supervised and unsupervised ML models to assess the predictability of aggressive behavior.
Analysis in context: the results of the developed models will be used by other researchers’ experts in the animal field to draw conclusions about their research.

Requirements

Interest/Experience with image processing.
Python programming knowledge (Pytorch bonus).

Language
English.

Starting date
As soon as possible.

Duration
6 months minimum (can be increased to 12 month if student is interested).

Supervisors
Paolo Favaro, Hassan-Roland Nasser.

Institutes
Computer Vison Group, Posieux (home office possible).

Contact
Hassan-Roland Nasser, hassan-roland.nasser@agroscope.admin.ch .

AI deconvolution of light microscopy images

Level: Master

Oct. 2023

Background
Light microscopy became an indispensable tool in life sciences research. Deconvolution is an important image processing step in improving the quality of microscopy images for removing out-of-focus light, higher resolution, and beter signal to noise ratio. Currently classical deconvolution methods, such as regularisation or blind deconvolution, are implemented in numerous commercial software packages and widely used in research. Recently AI deconvolution algorithms have been introduced and being currently actively developed, as they showed a high application potential.

Aim
Adaptation of available AI algorithms for deconvolution of microscopy images. Validation of these methods against state-of-the -art commercially available deconvolution software.

Material and Methods
Student will implement and further develop available AI deconvolution methods and acquire test microscopy images of different modalities. Performance of developed AI algorithms will be validated against available commercial deconvolution software.

Nature of thesis

Al algorithm development and implementation: 50%.
Data acquisition: 10%.
Comparison of performance: 40 %.

Requirements

Interest in imaging.
Solid knowledge of AI.
Good programming skills.

Supervisors
Paolo Favaro, Guillaume Witz, Yury Belyaev.

Institutes
Computer Vison Group, Digital Science Lab, Microscopy imaging Center.

Contact
Yury Belyaev, Microscopy imaging Center, yury.belyaev@unibern.ch, + 41 78 899 0110.

Instance segmentation of cryo-ET images

Level: Bachelor/Master

July 2023

In the 1600s, a pioneering Dutch scientist named Antonie van Leeuwenhoek embarked on a remarkable journey that would forever transform our understanding of the natural world. Armed with a simple yet ingenious invention, the light microscope, he delved into uncharted territory, peering through its lens to reveal the hidden wonders of microscopic structures. Fast forward to today, where cryo-electron tomography (cryo-ET) has emerged as a groundbreaking technique, allowing researchers to study proteins within their natural cellular environments. Proteins, functioning as vital nano-machines, play crucial roles in life and understanding their localization and interactions is key to both basic research and disease comprehension. However, cryo-ET images pose challenges due to inherent noise and a scarcity of annotated data for training deep learning models.

Credit: S. Albert et al./PNAS (CC BY 4.0)

To address these challenges, this project aims to develop a self-supervised pipeline utilizing diffusion models for instance segmentation in cryo-ET images. By leveraging the power of diffusion models, which iteratively diffuse information to capture underlying patterns, the pipeline aims to refine and accurately segment cryo-ET images. Self-supervised learning, which relies on unlabeled data, reduces the dependence on extensive manual annotations. Successful implementation of this pipeline could revolutionize the field of structural biology, facilitating the analysis of protein distribution and organization within cellular contexts. Moreover, it has the potential to alleviate the limitations posed by limited annotated data, enabling more efficient extraction of valuable information from cryo-ET images and advancing biomedical applications by enhancing our understanding of protein behavior.

Methods
The segmentation pipeline for cryo-electron tomography (cryo-ET) images consists of two stages: training a diffusion model for image generation and training an instance segmentation U-Net using synthetic and real segmentation masks.

    1. Diffusion Model Training:
        a. Data Collection: Collect and curate cryo-ET image datasets from the EMPIAR
            database (https://www.ebi.ac.uk/empiar/).
        b. Architecture Design: Select an appropriate architecture for the diffusion model.
        c. Model Evaluation: Cryo-ET experts will help assess image quality and fidelity
            through visual inspection and quantitative measures
    2. Building the Segmentation dataset:
        a. Synthetic and real mask generation: Use the trained diffusion model to generate
            synthetic cryo-ET images. The diffusion process will be seeded from either a real
            or a synthetic segmentation mask. This will yield to pairs of cryo-ET images and
            segmentation masks.
    3. Instance Segmentation U-Net Training:
        a. Architecture Design: Choose an appropriate instance segmentation U-Net
            architecture.
        b. Model Evaluation: Evaluate the trained U-Net using precision, recall, and F1
            score metrics.

By combining the diffusion model for cryo-ET image generation and the instance segmentation U-Net, this pipeline provides an efficient and accurate approach to segment structures in cryo-ET images, facilitating further analysis and interpretation.

References
    1. Kwon, Diana. "The secret lives of cells-as never seen before." Nature 598.7882 (2021):
        558-560.
    2. Moebel, Emmanuel, et al. "Deep learning improves macromolecule identification in 3D
        cellular cryo-electron tomograms." Nature methods 18.11 (2021): 1386-1394.
    3. Rice, Gavin, et al. "TomoTwin: generalized 3D localization of macromolecules in
        cryo-electron tomograms with structural data mining." Nature Methods (2023): 1-10.

Contacts
Prof. Thomas Lemmin
Institute of Biochemistry and Molecular Medicine
Bühlstrasse 28, 3012 Bern
(thomas.lemmin@unibe.ch)

Prof. Paolo Favaro
Institute of Computer Science
Neubrückstrasse 10 3012 Bern
(paolo.favaro@unibe.ch)

Adding and removing multiple sclerosis lesions with to imaging with diffusion networks

Level: Master

March 2023

Background Multiple sclerosis lesions are the result of demyelination: they appear as dark spots on T1 weighted MRI imaging and as bright spots on FLAIR MRI imaging. Image analysis for MS patients requires both the accurate detection of new and enhancing lesions, and the assessment of atrophy via local thickness and/or volume changes in the cortex. Detection of new and growing lesions is possible using deep learning, but made difficult by the relative lack of training data: meanwhile cortical morphometry can be affected by the presence of lesions, meaning that removing lesions prior to morphometry may be more robust. Existing ‘lesion filling’ methods are rather crude, yielding unrealistic-appearing brains where the borders of the removed lesions are clearly visible.

`Aim: Denoising diffusion networks are the current gold standard in MRI image generation [1]: we aim to leverage this technology to remove and add lesions to existing MRI images. This will allow us to create realistic synthetic MRI images for training and validating MS lesion segmentation algorithms, and for investigating the sensitivity of morphometry software to the presence of MS lesions at a variety of lesion load levels.`

`Materials and Methods: A large, annotated, heterogeneous dataset of MRI data from MS patients, as well as images of healthy controls without white matter lesions, will be available for developing the method. The student will work in a research group with a long track record in applying deep learning methods to neuroimaging data, as well as experience training denoising diffusion networks.`

`Nature of the Thesis:`

`Literature review: 10%`

`Replication of Blob Loss paper: 10%`

`Implementation of the sliding window metrics:10%`

`Training on MS lesion segmentation task: 30%`

`Extension to other datasets: 20%`

`Results analysis: 20%`

`Fig. Results of an existing lesion filling algorithm, showing inadequate performance`

`Requirements:`

`Interest/Experience with image processing`

`Python programming knowledge (Pytorch bonus)`

`Interest in neuroimaging`

`Supervisor(s):`

`PD. Dr. Richard McKinley`

`Institutes: Diagnostic and Interventional Neuroradiology`

`Center for Artificial Intelligence in Medicine (CAIM), University of Bern`

`References: [1] Brain Imaging Generation with Latent Diffusion Models, Pinaya et al, Accepted in the Deep Generative Models workshop @ MICCAI 2022, https://arxiv.org/abs/2209.07162`

`Contact : PD Dr Richard McKinley, Support Centre for Advanced Neuroimaging (richard.mckinley@insel.ch)`

Improving metrics and loss functions for targets with imbalanced size: sliding window Dice coefficient and loss.

Level: Master

March 2023

Background The Dice coefficient is the most commonly used metric for segmentation quality in medical imaging, and a differentiable version of the coefficient is often used as a loss function, in particular for small target classes such as multiple sclerosis lesions. Dice coefficient has the benefit that it is applicable in instances where the target class is in the minority (for example, in case of segmenting small lesions). However, if lesion sizes are mixed, the loss and metric is biased towards performance on large lesions, leading smaller lesions to be missed and harming overall lesion detection. A recently proposed loss function (blob loss[1]) aims to combat this by treating each connected component of a lesion mask separately, and claims improvements over Dice loss on lesion detection scores in a variety of tasks.

Aim: The aim of this thesisis twofold. First, to benchmark blob loss against a simple, potentially superior loss for instance detection: sliding window Dice loss, in which the Dice loss is calculated over a sliding window across the area/volume of the medical image. Second, we will investigate whether a sliding window Dice coefficient is better corellated with lesion-wise detection metrics than Dice coefficient and may serve as an alternative metric capturing both global and instance-wise detection.

Materials and Methods: A large, annotated, heterogeneous dataset of MRI data from MS patients will be available for benchmarking the method, as well as our existing codebases for MS lesion segmentation. Extension of the method to other diseases and datasets (such as covered in the blob loss paper) will make the method more plausible for publication. The student will work alongside clinicians and engineers carrying out research in multiple sclerosis lesion segmentation, in particular in the context of our running project supported by the CAIM grant.

Nature of the Thesis:

Literature review: 10%

Replication of Blob Loss paper: 10%

Implementation of the sliding window metrics:10%

Training on MS lesion segmentation task: 30%

Extension to other datasets: 20%

Results analysis: 20%

Fig. An annotated MS lesion case, showing the variety of lesion sizes

Requirements:

Interest/Experience with image processing

Python programming knowledge (Pytorch bonus)

Interest in neuroimaging

Supervisor(s):

PD. Dr. Richard McKinley

Institutes: Diagnostic and Interventional Neuroradiology

Center for Artificial Intelligence in Medicine (CAIM), University of Bern

References: [1] blob loss: instance imbalance aware loss functions for semantic segmentation, Kofler et al, https://arxiv.org/abs/2205.08209

Contact : PD Dr Richard McKinley, Support Centre for Advanced Neuroimaging (richard.mckinley@insel.ch)

Idempotent and partial skull-stripping in multispectral MRI imaging

Level: Master

March 2023

Background Skull stripping (or brain extraction) refers to the masking of non-brain tissue from structural MRI imaging. Since 3D MRI sequences allow reconstruction of facial features, many data providers supply data only after skull-stripping, making this a vital tool in data sharing. Furthermore, skull-stripping is an important pre-processing step in many neuroimaging pipelines, even in the deep-learning era: while many methods could now operate on data with skull present, they have been trained only on skull-stripped data and therefore produce spurious results on data with the skull present.

High-quality skull-stripping algorithms based on deep learning are now widely available: the most prominent example is HD-BET [1]. A major downside of HD-BET is its behaviour on datasets to which skull-stripping has already been applied: in this case the algorithm falsely identifies brain tissue as skull and masks it. A skull-stripping algorithm F not exhibiting this behaviour would be idempotent: F(F(x)) = F(x) for any image x. Furthermore, legacy datasets from before the availability of high-quality skull-stripping algorithms may still contain images which have been inadequately skull-stripped: currently the only solution to improve the skull-stripping on this data is to go back to the original datasource or to manually correct the skull-stripping, which is time-consuming and prone to error.

Aim: In this project, the student will develop an idempotent skull-stripping network which can also handle partially skull-stripped inputs. In the best case, the network will operate well on a large subset of the data we work with (e.g. structural MRI, diffusion-weighted MRI, Perfusion-weighted MRI, susceptibility-weighted MRI, at a variety of field strengths) to maximize the future applicability of the network across the teams in our group.

Materials and Methods: Multiple datasets, both publicly available and internal (encompassing thousands of 3D volumes) will be available. Silver standard reference data for standard sequences at 1.5T and 3T can be generated using existing tools such as HD-BET: for other sequences and field strengths semi-supervised learning or methods improving robustness to domain shift may be employed. Robustness to partial skull-stripping may be induced by a combination of learning theory and model-based approaches.

Nature of the Thesis:

Literature review: 10%

Dataset curation: 10%

Idempotent skull-stripping model building: 30%

Modelling of partial skull-stripping:10%

Extension of model to handle partial skull: 30%

Results analysis: 10%

Fig. An example of failed skull-stripping requiring manual correction

Requirements:

Interest/Experience with image processing

Python programming knowledge (Pytorch bonus)

Interest in neuroimaging

Supervisor(s):

PD. Dr. Richard McKinley

Institutes: Diagnostic and Interventional Neuroradiology

Center for Artificial Intelligence in Medicine (CAIM), University of Bern

References: [1] Isensee, F, Schell, M, Pflueger, I, et al. Automated brain extraction of multisequence MRI using artificial neural networks. Hum Brain Mapp. 2019; 40: 4952– 4964. https://doi.org/10.1002/hbm.24750

Contact : PD Dr Richard McKinley, Support Centre for Advanced Neuroimaging (richard.mckinley@insel.ch)

Automated leaf detection and leaf area estimation (for Arabidopsis thaliana)

Level: Master

Sept. 2022

Correlating plant phenotypes such as leaf area or number of leaves to the genotype (i.e. changes in DNA) is a common goal for plant breeders and molecular biologists. Such data can not only help to understand fundamental processes in nature, but also can help to improve ecotypes, e.g., to perform better under climate change, or reduce fertiliser input. However, collecting data for many plants is very time consuming and automated data acquisition is necessary.

The project aims at building a machine learning model to automatically detect plants in top-view images (see examples below), segment their leaves (see Fig C) and to estimate the leaf area. This information will then be used to determine the leaf area of different Arabidopsis ecotypes. The project will be carried out in collaboration with researchers of the Institute of Plant Sciences at the University of Bern. It will also involve the design and creation of a dataset of plant top-views with the corresponding annotation (provided by experts at the Institute of Plant Sciences).

Contact: Prof. Dr. Paolo Favaro (paolo.favaro@unibe.ch)

Master Projects at the ARTORG Center

Level: Master

Nov. 2021

The Gerontechnology and Rehabilitation group at the ARTORG Center for Biomedical Engineering is offering multiple MSc thesis projects to students, which are interested in working with real patient data, artificial intelligence and machine learning algorithms. The goal of these projects is to transfer the findings to the clinic in order to solve today’s healthcare problems and thus to improve the quality of life of patients.

Assessment of Digital Biomarkers at Home by Radar. [PDF]

Comparison of Radar, Seismograph and Ballistocardiography and to Monitor Sleep at Home. [PDF]

Sentimental Analysis in Speech. [PDF]

Contact: Dr. Stephan Gerber (stephan.gerber@artorg.unibe.ch)

Internship in Computational Imaging at Prophesee

Level: Master

Nov. 2021

A 6 month intership at Prophesee, Grenoble is offered to a talented Master Student.

The topic of the internship is working on burst imaging following the work of Sam Hasinoff, and exploring ways to improve it using event-based vision.

A compensation to cover the expenses of living in Grenoble is offered. Only students that have legal rights to work in France can apply.

Anyone interested can send an email with the CV to Daniele Perrone (dperrone@prophesee.ai).

Using machine learning applied to wearables to predict mental health

Level: Master

June 2021

This Master’s project lies at the intersection of psychiatry and computer science and aims to use machine learning techniques to improve health. Using sensors to detect sleep and waking behavior has as of yet unexplored potential to reveal insights into health. In this study, we make use of a watch-like device, called an actigraph, which tracks motion to quantify sleep behavior and waking activity. Participants in the study consist of healthy and depressed adolescents and wear actigraphs for a year during which time we query their mental health status monthly using online questionnaires. For this masters thesis we aim to make use of machine learning methods to predict mental health based on the data from the actigraph. The ability to predict mental health crises based on sleep and wake behavior would provide an opportunity for intervention, significantly impacting the lives of patients and their families. This Masters thesis is a collaboration between Professor Paolo Favaro at the Institute of Computer Science (paolo.favaro@inf.unibe.ch) and Dr Leila Tarokh at the Universitäre Psychiatrische Dienste (UPD) (leila.tarokh@upd.unibe.ch). We are looking for a highly motivated individual interested in bridging disciplines.

Bachelor or Master Projects at the ARTORG Center

Level: Bachelor/Master

Feb. 2021

The Gerontechnology and Rehabilitation group at the ARTORG Center for Biomedical Engineering is offering multiple BSc- and MSc thesis projects to students, which are interested in working with real patient data, artificial intelligence and machine learning algorithms. The goal of these projects is to transfer the findings to the clinic in order to solve today’s healthcare problems and thus to improve the quality of life of patients.

Machine Learning Based Gait-Parameter Extraction by Using Simple Rangefinder Technology. [PDF]

Detection of Motion in Video Recordings [PDF]

Home-Monitoring of Elderly by Radar [PDF]

Gait feature detection in Parkinson's Disease [PDF]

Development of an arthroscopic training device using virtual reality [PDF]

Contact: Dr. Stephan Gerber (stephan.gerber@artorg.unibe.ch), Michael Single (michael.single@artorg.unibe.ch)

Dynamic Transformer

Level: Bachelor

Jan. 2021

Visual Transformers have obtained state of the art classification accuracies [ViT, DeiT, T2T, BoTNet]. Mixture of experts could be used to increase the capacity of a neural network by learning instance dependent execution pathways in a network [MoE]. In this research project we aim to push the transformers to their limit and combine their dynamic attention with MoEs, compared to Switch Transformer [Switch], we will use a much more efficient formulation of mixing [CondConv, DynamicConv] and we will use this idea in the attention part of the transformer, not the fully connected layer.

Goals:

Input dependent attention kernel generation for better transformer layers.

Publication Opportunity: Dynamic Neural Networks Meets Computer Vision (a CVPR 2021 Workshop)

Extensions:

The same idea could be extended to other ViT/Transformer based models [DETR, SETR, LSTR, TrackFormer, BERT]

Related Papers:

Visual Transformers: Token-based Image Representation and Processing for Computer Vision [ViT]
DeiT: Data-efficient Image Transformers [DeiT]
Bottleneck Transformers for Visual Recognition [BoTNet]
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet [T2TViT]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer [MoE]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [Switch]
CondConv: Conditionally Parameterized Convolutions for Efficient Inference [CondConv]
Dynamic Convolution: Attention over Convolution Kernels [DynamicConv]
End-to-End Object Detection with Transformers [DETR]
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [SETR]
End-to-end Lane Shape Prediction with Transformers [LSTR]
TrackFormer: Multi-Object Tracking with Transformers [TrackFormer]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [BERT]

Contact: Sepehr Sameni

3d ViT

Level: Master

Jan. 2021

Visual Transformers have obtained state of the art classification accuracies for 2d images[ViT, DeiT, T2T, BoTNet]. In this project, we aim to extend the same ideas to 3d data (videos), which requires a more efficient attention mechanism [Performer, Axial, Linformer]. In order to accelerate the training process, we could use [Multigrid] technique.

Goals:

Better video understanding by attention blocks.

Publication Opportunity: LOVEU (a CVPR workshop), Holistic Video Understanding (a CVPR workshop), ActivityNet (a CVPR workshop)

Related Papers:

Visual Transformers: Token-based Image Representation and Processing for Computer Vision [ViT]
DeiT: Data-efficient Image Transformers [DeiT]
Bottleneck Transformers for Visual Recognition [BoTNet]
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet [T2TViT]
Rethinking Attention with Performers [Performer]
Axial Attention in Multidimensional Transformers [Axial]
Linformer: Self-Attention with Linear Complexity [Linformer]
A Multigrid Method for Efficiently Training Video Models [Multigrid]

Contact: Sepehr Sameni

iGIRAFFE

Level: Master

Jan. 2021

GIRAFFE is a newly introduced GAN that can generate scenes via composition with minimal supervision [GIRAFFE]. Generative methods can implicitly learn interpretable representation as can be seen in GAN image interpretations [GANSpace, GanLatentDiscovery]. Decoding GIRAFFE could give us per-object interpretable representations that could be used for scene manipulation, data augmentation, scene understanding, semantic segmentation, pose estimation [iNeRF], and more.

In order to invert a GIRAFFE model, we will first train the generative model on Clevr and CompCars datasets, then we add a decoder to the pipeline and train this autoencoder. We can make the task easier by knowing the number of objects in the scene and/or knowing their positions.

Goals:

Scene Manipulation and Decomposition by Inverting the GIRAFFE

Publication Opportunity: DynaVis 2021 (a CVPR workshop on Dynamic Scene Reconstruction)

Related Papers:

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields [GIRAFFE]
Neural Scene Graphs for Dynamic Scenes
pixelNeRF: Neural Radiance Fields from One or Few Images [pixelNeRF]
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis [NeRF]
Neural Volume Rendering: NeRF And Beyond
GANSpace: Discovering Interpretable GAN Controls [GANSpace]
Unsupervised Discovery of Interpretable Directions in the GAN Latent Space [GanLatentDiscovery]
Inverting Neural Radiance Fields for Pose Estimation [iNeRF]

Contact: Sepehr Sameni

Quantized ViT

Level: Master

Jan. 2021

Visual Transformers have obtained state of the art classification accuracies [ViT, CLIP, DeiT], but the best ViT models are extremely compute heavy and running them even only for inference (not doing backpropagation) is expensive. Running transformers cheaply by quantization is not a new problem and it has been tackled before for BERT [BERT] in NLP [Q-BERT, Q8BERT, TernaryBERT, BinaryBERT]. In this project we will be trying to quantize pretrained ViT models.

Goals:

Quantizing ViT models for faster inference and smaller models without losing accuracy

Publication Opportunity: Binary Networks for Computer Vision 2021 (a CVPR workshop)

Extensions:

Having a fast pipeline for image inference with ViT will allow us to dig deep into the attention of ViT and analyze it, we might be able to prune some attention heads or replace them with static patterns (like local convolution or dilated patterns), We might be even able to replace the transformer with performer and increase the throughput even more [Performer].
The same idea could be extended to other ViT based models [DETR, SETR, LSTR, TrackFormer, CPTR, BoTNet, T2TViT]

Related Papers:

Learning Transferable Visual Models From Natural Language Supervision [CLIP]
Visual Transformers: Token-based Image Representation and Processing for Computer Vision [ViT]
DeiT: Data-efficient Image Transformers [DeiT]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [BERT]
Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT [Q-BERT]
Q8BERT: Quantized 8Bit BERT [Q8BERT]
TernaryBERT: Distillation-aware Ultra-low Bit BERT [TernaryBERT]
BinaryBERT: Pushing the Limit of BERT Quantization [BinaryBERT]
Rethinking Attention with Performers [Performer]
End-to-End Object Detection with Transformers [DETR]
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [SETR]
End-to-end Lane Shape Prediction with Transformers [LSTR]
TrackFormer: Multi-Object Tracking with Transformers [TrackFormer]
CPTR: Full Transformer Network for Image Captioning [CPTR]
Bottleneck Transformers for Visual Recognition [BoTNet]
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet [T2TViT]

Contact: Sepehr Sameni

Multimodal Contrastive Learning

Level: Bachelor

Jan. 2021

Recently contrastive learning has gained a lot of attention for self-supervised image representation learning [SimCLR, MoCo]. Contrastive learning could be extended to multimodal data, like videos (images and audio) [CMC, CoCLR]. Most contrastive methods require large batch sizes (or large memory pools) which makes them expensive for training. In this project we are going to use non batch size dependent contrastive methods [SwAV, BYOL, SimSiam] to train multimodal representation extractors.

Our main goal is to compare the proposed method with the CMC baseline, so we will be working with STL10, ImageNet, UCF101, HMDB51, and NYU Depth-V2 datasets.

Inspired by the recent works on smaller datasets [ConVIRT, CPD], to accelerate the training speed, we could start with two pretrained single-modal models and finetune them with the proposed method.

Goals:

Extending SwAV to multimodal datasets
Grasping a better understanding of the BYOL

Publication Opportunity: MULA 2021 (a CVPR workshop on Multimodal Learning and Applications)

Extensions:

Most knowledge distillation methods for contrastive learners also use large batch sizes (or memory pools) [CRD, SEED], the proposed method could be extended for knowledge distillation.
One could easily extend this idea to multiview learning, for example one could have two different networks working on the same input and train them with contrastive learning, this may lead to better models [DeiT] by cross-model inductive biases communications.

Related Papers:

Self-supervised Co-training for Video Representation Learning [CoCLR]
Learning Spatiotemporal Features via Video and Text Pair Discrimination [CPD]
Audio-Visual Instance Discrimination with Cross-Modal Agreement [AVID-CMA]
Self-Supervised Learning by Cross-Modal Audio-Video Clustering [XDC]
Contrastive Multiview Coding [CPC]
Learning Transferable Visual Models From Natural Language Supervision [CLIP]
Contrastive Learning of Medical Visual Representations from Paired Images and Text [ConVIRT]
A Simple Framework for Contrastive Learning of Visual Representations [SimCLR]
Momentum Contrast for Unsupervised Visual Representation Learning [MoCo]
Bootstrap your own latent: A new approach to self-supervised Learning [BYOL]
Exploring Simple Siamese Representation Learning [SimSiam]
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments [SwAV]
DeiT: Data-efficient Image Transformers [DeiT]
Contrastive Representation Distillation [CRD]
SEED: Self-supervised Distillation For Visual Representation [SEED]

Contact: Sepehr Sameni

Robustness of Neural Networks

Level: Bachelor/Master

Jan. 2021

Neural Networks have been found to achieve surprising performance in several tasks such as classification, detection and segmentation. However, they are also very sensitive to small (controlled) changes to the input. It has been shown that some changes to an image that are not visible to the naked eye may lead the network to output an incorrect label. This thesis will focus on studying recent progress in this area and aim to build a procedure for a trained network to self-assess its reliability in classification or one of the popular computer vision tasks.

Contact: Paolo Favaro

Masters projects at sitem center

Level: Master

Feb. 2020

The Personalised Medicine Research Group at the sitem Center for Translational Medicine and Biomedical Entrepreneurship is offering multiple MSc thesis projects to the biomed eng MSc students that may also be of interest to the computer science students.

Automated quantification of cartilage quality for hip treatment decision support. PDF

Automated quantification of massive rotator cuff tears from MRI. PDF

Deep learning-based segmentation and fat fraction analysis of the shoulder muscles using quantitative MRI. PDF

Unsupervised Domain Adaption for Cross-Modality Hip Joint Segmentation. PDF

Contact: Dr. Kate Gerber

Internships/Master thesis @ Chronocam

Level: Master

Oct. 2019

3-6 months internships on event-based computer vision. Chronocam is a rapidly growing startup developing event-based technology, with more than 15 PhDs working on problems like tracking, detection, classification, SLAM, etc. Event-based computer vision has the potential to solve many long-standing problems in traditional computer vision, and this is a super exciting time as this potential is becoming more and more tangible in many real-world applications. For next year we are looking for motivated Master and PhD students with good software engineering skills (C++ and/or python), and preferable good computer vision and deep learning background. PhD internships will be more research focused and possibly lead to a publication.
For each intern we offer a compensation to cover the expenses of living in Paris.
List of some of the topics we want to explore:

Photo-realistic image synthesis and super-resolution from event-based data (PhD)
Self-supervised representation learning (PhD)
End-to-end Feature Learning for Event-based Data
Bio-inspired Filtering using Spiking Networks
On-the fly Compression of Event-based Streams for Low-Power IoT Cameras
Tracking of Multiple Objects with a Dual-Frequency Tracker
Event-based Autofocus
Stabilizing an Event-based Stream using an IMU
Crowd Monitoring for Low-power IoT Cameras
Road Extraction from an Event-based Camera Mounted in a Car for Autonomous Driving
Sign detection from an Event-based Camera Mounted in a Car for Autonomous Driving
High-frequency Eye Tracking

Email with attached CV to Daniele Perrone at dperrone@chronocam.com.

Contact: Daniele Perrone

Object Detection in 3D Point Clouds

Martí Farré Farrús · June 2024

This thesis presents Quasi-OmniFastTrack, an improved version of the OmniMotion algorithm for long-term pixel tracking in videos. The key contribution is reducing the computational expense and training time of OmniMotion while maintaining comparable tracking performance. The main bottleneck in OmniMotion was identified to be the NeRF network used for 3D scene representation. Quasi-OmniFastTrack replaces this with a pre-trained depth estimation model, significantly reducing training time, based on the work introduced in OmniFastTrack, hence the name. The invertible neural network for mapping between local and canonical coordinates is retained, but optimized depths are used to lift 2D pixels to 3D. Experiments show that Quasi-OmniFastTrack reduces training time by over 50% compared to OmniMotion while achieving similar qualitative tracking results on sequences with occlusions. Performance degrades somewhat on fast-moving scenes. The ablation studies demonstrate the importance of optimizing the initial depth estimates during training. While not matching OmniMotion's robustness in all scenarios, Quasi-OmniFastTrack offers a compelling speed-accuracy tradeoff, enabling long-term tracking on more videos in practical timeframes. Future work on incorporating other modifications introduced in OmniFastTrack, like long-term semantic features, could further improve tracking consistency.

Exploration of Position Oriented Pre-training for Bidirectional Encoder Representations from Transformers

PDF

Merlin Streilein · Sept. 2023

PDF

This paper deals with exploring the importance of positional information learned in pre-training of a large language model like BERT. We introduce a simple pre-training task, predicting the absolute position of a given bag of words, and evaluate the impact on the performance of two downstream tasks: Named Entitiy Recognition and Sentiment Analysis. The challenge to the pre-training task is that all positional encoding has been removed and the model is required to learn the relationship between position of a word and its syntactic and semantic meaning in a sentence. In order to achive robuster and more natural results a simpler version of MLM as well as the Sinkhorn-Knopp Algorithm are explored. The positional information learned durning pre-training leads to a competitive understanding of language, which then leads to good results when transferred to downstream tasks. We further highlights some observations in relation with the setting of the pre-training task and evaluates the changes brought with the Sinkhorn-Knopp Algorithm and simple Masked Language Modeling.

New Variables of Brain Morphometry: the Potential and Limitations of CNN Regression

Timo Blattner · Sept. 2022

The calculation of variables of brain morphology is computationally very expensive and time-consuming. A previous work showed the feasibility of ex- tracting the variables directly from T1-weighted brain MRI images using a con- volutional neural network. We used significantly more data and extended their model to a new set of neuromorphological variables, which could become inter- esting biomarkers in the future for the diagnosis of brain diseases. The model shows for nearly all subjects a less than 5% mean relative absolute error. This high relative accuracy can be attributed to the low morphological variance be- tween subjects and the ability of the model to predict the cortical atrophy age trend. The model however fails to capture all the variance in the data and shows large regional differences. We attribute these limitations in part to the moderate to poor reliability of the ground truth generated by FreeSurfer. We further investigated the effects of training data size and model complexity on this regression task and found that the size of the dataset had a significant impact on performance, while deeper models did not perform better. Lack of interpretability and dependence on a silver ground truth are the main drawbacks of this direct regression approach.

Home Monitoring by Radar

PDF

Lars Ziegler · Sept. 2022

PDF

Alvaro Juan Lahiguera · Jan. 2019

Coma Outcome Prediction with Convolutional Neural Networks

Stefan Jonas · Oct. 2018

Automatic Correction of Self-Introduced Errors in Source Code

Sven Kellenberger · Aug. 2018

Neural Face Transfer: Training a Deep Neural Network to Face-Swap

PDF

Till Nikolaus Schnabel · July 2018

PDF

This thesis explores the field of artificial neural networks with realistic looking visual outputs. It aims at morphing face pictures of a specific identity to look like another individual by only modifying key features, such as eye color, while leaving identity-independent features unchanged. Prior works have covered the topic of symmetric translation between two specific domains but failed to optimize it on faces where only parts of the image may be changed. This work applies a face masking operation to the output at training time, which forces the image generator to preserve colors while altering the face, fitting it naturally inside the unmorphed surroundings. Various experiments are conducted including an ablation study on the final setting, decreasing the baseline identity switching performance from 81.7% to 75.8 % whilst improving the average χ2 color distance from 0.551 to 0.434. The provided code-based software gives users easy access to apply this neural face swap to images and videos of arbitrary crop and brings Computer Vision one step closer to replacing Computer Graphics in this specific area.

A Study of the Importance of Parts in the Deformable Parts Model

Sammer Puran · June 2017

Self-Similarity as a Meta Feature

Lucas Husi · April 2017

A Study of 3D Deformable Parts Models for Detection and Pose-Estimation

Simon Jenni · March 2015

Reconstructing Highly Folded Cortices A Few-Shot Learning Approach to Investigate Universal Brain Folding

PDF

Timo Blattner · June 2025

PDF

Recently, it has been shown that all mammal brains fold in a similar fashion, following the same mechanical model of folding. However, cetaceans remain outliers, having a systematically more folded brain than expected. A current hypothesis suggests that this is due to the increase in ambient pressure on the brain when these species dive, but this remains to be shown. Reconstructing these cortical surfaces is extremely difficult due to their high degree of folding and has never been done accurately before. We present a novel cortical surface reconstruction method, based on a few-shot learning of 2D expert manual tracings in each scan, to segment the full 3D image. From the segmentation, we reconstruct the white matter surface and displace it to the pial surface using a diffeomorphism. We successfully reconstruct the brains of 3 non-cetacean and 4 cetacean brains. We investigate the number of labeled slices needed for training a model to accurately reconstruct the cortical surface, and benchmark our method in humans. We show that these models can be used to label unseen scans of anatomically similar species, eliminating the need for manual labor. Our measurements support the validity of this pressure hypothesis. https://github.com/TimoBl/Few-Shot-Cortex

Accelerated Federated Learning on Client Silos with Label Noise: RHO Selection in Classification and Segmentation

Representation Learning using Semantic Distances

Markus Roth · May 2019

Zero-Shot Learning using Generative Adversarial Networks

Hamed Hemati · Dec. 2018

Dimensionality Reduction via CNNs - Learning the Distance between Images

Ioannis Glampedakis · Sept. 2018

Learning to Play Othello using Deep Reinforcement Learning and Self Play

Thomas Simon Steinmann · Sept. 2018

ABA-J Interactive Multi-Modality Tissue Sectionto-Volume Alignment: A Brain Atlasing Toolkit for ImageJ

Felix Meyenhofer · March 2018

Learning Visual Odometry with Recurrent Neural Networks

PDF

Adrian Wälchli · Feb. 2018

PDF

In computer vision, Visual Odometry is the problem of recovering the camera motion from a video. It is related to Structure from Motion, the problem of reconstructing the 3D geometry from a collection of images. Decades of research in these areas have brought successful algorithms that are used in applications like autonomous navigation, motion capture, augmented reality and others. Despite the success of these prior works in real-world environments, their robustness is highly dependent on manual calibration and the magnitude of noise present in the images in form of, e.g., non-Lambertian surfaces, dynamic motion and other forms of ambiguity. This thesis explores an alternative approach to the Visual Odometry problem via Deep Learning, that is, a specific form of machine learning with artificial neural networks. It describes and focuses on the implementation of a recent work that proposes the use of Recurrent Neural Networks to learn dependencies over time due to the sequential nature of the input. Together with a convolutional neural network that extracts motion features from the input stream, the recurrent part accumulates knowledge from the past to make camera pose estimations at each point in time. An analysis on the performance of this system is carried out on real and synthetic data. The evaluation covers several ways of training the network as well as the impact and limitations of the recurrent connection for Visual Odometry.

Crime location and timing prediction

Bernard Swart · Jan. 2018

From Cartoons to Real Images: An Approach to Unsupervised Visual Representation Learning

Simon Jenni · Feb. 2017

Automatic and Large-Scale Assessment of Fluid in Retinal OCT Volume

Nina Mujkanovic · Dec. 2016

Segmentation in 3D using eye-tracking technology

Michele Wyss · July 2016

Accurate Scale Thresholding via Logarithmic Total Variation Prior

PDF

Thoma Papadhimitri · June 2014

PDF

This thesis investigates the problem of 3D reconstruction of a scene from 2D images. In particular, we focus on photometric stereo which is a technique that computes the 3D geometry from at least three images taken from the same viewpoint and under different illumination conditions. When the illumination is unknown (uncalibrated photometric stereo) the problem is ambiguous: different combinations of geometry and illumination can generate the same images. First, we solve the ambiguity by exploiting the Lambertian reflectance maxima. These are points defined on curved surfaces where the normals are parallel to the light direction. Then, we propose a solution that can be computed in closed-form and thus very efficiently. Our algorithm is also very robust and yields always the same estimate regardless of the initial ambiguity. We validate our method on real world experiments and achieve state-of-art results. In this thesis we also solve for the first time the uncalibrated photometric stereo problem under the perspective projection model. We show that unlike in the orthographic case, one can uniquely reconstruct the normals of the object and the lights given only the input images and the camera calibration (focal length and image center). We also propose a very efficient algorithm which we validate on synthetic and real world experiments and show that the proposed technique is a generalization of the orthographic case. Finally, we investigate the uncalibrated photometric stereo problem in the case where the lights are distributed near the scene. In this case we propose an alternating minimization technique which converges quickly and overcomes the limitations of prior work that assumes distant illumination. We show experimentally that adopting a near-light model for real world scenes yields very accurate reconstructions.