A list of completed theses and new thesis topics from the Computer Vision Group.
Are you about to start a BSc or MSc thesis? Please read our instructions for preparing and delivering your work.
Below we list possible thesis topics for Bachelor and Master students in the areas of Computer Vision, Machine Learning, Deep Learning and Pattern Recognition. The project descriptions leave plenty of room for your own ideas. If you would like to discuss a topic in detail, please contact the supervisor listed below and Prof. Paolo Favaro to schedule a meeting. Note that for MSc students in Computer Science it is required that the official advisor is a professor in CS.
This Master’s project lies at the intersection of psychiatry and computer science and aims to use machine learning techniques to improve health. Using sensors to detect sleep and waking behavior has as of yet unexplored potential to reveal insights into health. In this study, we make use of a watch-like device, called an actigraph, which tracks motion to quantify sleep behavior and waking activity. Participants in the study consist of healthy and depressed adolescents and wear actigraphs for a year during which time we query their mental health status monthly using online questionnaires. For this masters thesis we aim to make use of machine learning methods to predict mental health based on the data from the actigraph. The ability to predict mental health crises based on sleep and wake behavior would provide an opportunity for intervention, significantly impacting the lives of patients and their families. This Masters thesis is a collaboration between Professor Paolo Favaro at the Institute of Computer Science (firstname.lastname@example.org
) and Dr Leila Tarokh at the Universitäre Psychiatrische Dienste (UPD) (email@example.com ). We are looking for a highly motivated individual interested in bridging disciplines.
The Gerontechnology and Rehabilitation group at the ARTORG Center for Biomedical Engineering is offering multiple BSc- and MSc thesis projects to students, which are interested in working with real patient data, artificial intelligence and machine learning algorithms. The goal of these projects is to transfer the findings to the clinic in order to solve today’s healthcare problems and thus to improve the quality of life of patients.
- Machine Learning Based Gait-Parameter Extraction by Using Simple Rangefinder Technology. [PDF]
- Detection of Motion in Video Recordings [PDF]
- Home-Monitoring of Elderly by Radar [PDF]
- Gait feature detection in Parkinson's Disease [PDF]
- Development of an arthroscopic training device using virtual reality [PDF]
Visual Transformers have obtained state of the art classification accuracies [ViT, DeiT, T2T, BoTNet]. Mixture of experts could be used to increase the capacity of a neural network by learning instance dependent execution pathways in a network [MoE]. In this research project we aim to push the transformers to their limit and combine their dynamic attention with MoEs, compared to Switch Transformer [Switch], we will use a much more efficient formulation of mixing [CondConv, DynamicConv] and we will use this idea in the attention part of the transformer, not the fully connected layer.
Publication Opportunity: Dynamic Neural Networks Meets Computer Vision (a CVPR 2021 Workshop)
Contact: Sepehr Sameni
Visual Transformers have obtained state of the art classification accuracies for 2d images[ViT, DeiT, T2T, BoTNet]. In this project, we aim to extend the same ideas to 3d data (videos), which requires a more efficient attention mechanism [Performer, Axial, Linformer]. In order to accelerate the training process, we could use [Multigrid] technique.
Contact: Sepehr Sameni
GIRAFFE is a newly introduced GAN that can generate scenes via composition with minimal supervision [GIRAFFE]. Generative methods can implicitly learn interpretable representation as can be seen in GAN image interpretations [GANSpace, GanLatentDiscovery]. Decoding GIRAFFE could give us per-object interpretable representations that could be used for scene manipulation, data augmentation, scene understanding, semantic segmentation, pose estimation [iNeRF], and more.
In order to invert a GIRAFFE model, we will first train the generative model on Clevr and CompCars datasets, then we add a decoder to the pipeline and train this autoencoder. We can make the task easier by knowing the number of objects in the scene and/or knowing their positions.
Scene Manipulation and Decomposition by Inverting the GIRAFFE
Publication Opportunity: DynaVis 2021 (a CVPR workshop on Dynamic Scene Reconstruction)
Contact: Sepehr Sameni
Contact: Sepehr Sameni
Visual Transformers have obtained state of the art classification accuracies [ViT, CLIP, DeiT], but the best ViT models are extremely compute heavy and running them even only for inference (not doing backpropagation) is expensive. Running transformers cheaply by quantization is not a new problem and it has been tackled before for BERT [BERT] in NLP [Q-BERT, Q8BERT, TernaryBERT, BinaryBERT]. In this project we will be trying to quantize pretrained ViT models.
Quantizing ViT models for faster inference and smaller models without losing accuracy
Publication Opportunity: Binary Networks for Computer Vision 2021 (a CVPR workshop)
Contact: Sepehr Sameni
Recently contrastive learning has gained a lot of attention for self-supervised image representation learning [SimCLR, MoCo]. Contrastive learning could be extended to multimodal data, like videos (images and audio) [CMC, CoCLR]. Most contrastive methods require large batch sizes (or large memory pools) which makes them expensive for training. In this project we are going to use non batch size dependent contrastive methods [SwAV, BYOL, SimSiam] to train multimodal representation extractors.
Our main goal is to compare the proposed method with the CMC baseline, so we will be working with STL10, ImageNet, UCF101, HMDB51, and NYU Depth-V2 datasets.
Inspired by the recent works on smaller datasets [ConVIRT, CPD], to accelerate the training speed, we could start with two pretrained single-modal models and finetune them with the proposed method.
Publication Opportunity: MULA 2021 (a CVPR workshop on Multimodal Learning and Applications)
Contact: Sepehr Sameni
Neural Networks have been found to achieve surprising performance in several tasks such as classification, detection and segmentation. However, they are also very sensitive to small (controlled) changes to the input. It has been shown that some changes to an image that are not visible to the naked eye may lead the network to output an incorrect label. This thesis will focus on studying recent progress in this area and aim to build a procedure for a trained network to self-assess its reliability in classification or one of the popular computer vision tasks.
Contact: Paolo Favaro
The Personalised Medicine Research Group at the sitem Center for Translational Medicine and Biomedical Entrepreneurship is offering multiple MSc thesis projects to the biomed eng MSc students that may also be of interest to the computer science students.
- Automated quantification of cartilage quality for hip treatment decision support. PDF
- Automated quantification of massive rotator cuff tears from MRI. PDF
- Deep learning-based segmentation and fat fraction analysis of the shoulder muscles using quantitative MRI. PDF
- Unsupervised Domain Adaption for Cross-Modality Hip Joint Segmentation. PDF
Contact: Dr. Kate Gerber
3-6 months internships on event-based computer vision. Chronocam is a rapidly growing startup developing event-based technology, with more than 15 PhDs working on problems like tracking, detection, classification, SLAM, etc. Event-based computer vision has the potential to solve many long-standing problems in traditional computer vision, and this is a super exciting time as this potential is becoming more and more tangible in many real-world applications. For next year we are looking for motivated Master and PhD students with good software engineering skills (C++ and/or python), and preferable good computer vision and deep learning background. PhD internships will be more research focused and possibly lead to a publication.
For each intern we offer a compensation to cover the expenses of living in Paris.
List of some of the topics we want to explore:
Email with attached CV to Daniele Perrone at firstname.lastname@example.org.
Contact: Daniele Perrone
Today we have many 3D scanning techniques that allow us to capture the shape and appearance of objects. It is easier than ever to scan real 3D objects and transform them into a digital model for further processing, such as modeling, rendering or animation. However, the output of a 3D scanner is often a raw point cloud with little to no annotations. The unstructured nature of the point cloud representation makes it difficult for processing, e.g. surface reconstruction. One application is the detection and segmentation of an object of interest.
In this project, the student is challenged to design a system that takes a point cloud (a 3D scan) as input and outputs the names of objects contained in the scan. This output can then be used to eliminate outliers or points that belong to the background. The approach involves collecting a large dataset of 3D scans and training a neural network on it.
Contact: Adrian Wälchli
A photograph accurately captures the world in a moment of time and from a specific perspective. Since it is a projection of the 3D space to a 2D image plane, the depth information is lost. Is it possible to restore it, given only a single photograph? In general, the answer is no. This problem is ill-posed, meaning that many different plausible depth maps exist, and there is no way of telling which one is the correct one.
However, if we cover one of our eyes, we are still able to recognize objects and estimate how far away they are. This motivates the exploration of an approach where prior knowledge can be leveraged to reduce the ill-posedness of the problem. Such a prior could be learned by a deep neural network, trained with many images and depth maps.
Deblurring finds many applications in our everyday life. It is particularly useful when taking pictures on handheld devices (e.g. smartphones) where camera shake can degrade important details. Therefore, it is desired to have a good deblurring algorithm implemented directly in the device.
In this project, the student will implement and optimize a state-of-the-art deblurring method based on a deep neural network for deployment on mobile phones (Android).
The goal is to reduce the number of network weights in order to reduce the memory footprint while preserving the quality of the deblurred images. The result will be a camera app that automatically deblurs the pictures, giving the user a choice of keeping the original or the deblurred image.
If an object in front of the camera or the camera itself moves while the aperture is open, the region of motion becomes blurred because the incoming light is accumulated in different positions across the sensor. If there is camera motion, there is also parallax. Thus, a motion blurred image contains depth information.
In this project, the student will tackle the problem of recovering a depth-map from a motion-blurred image. This includes the collection of a large dataset of blurred- and sharp images or videos using a pair or triplet of GoPro action cameras. Two cameras will be used in stereo to estimate the depth map, and the third captures the blurred frames. This data is then used to train a convolutional neural network that will predict the depth map from the blurry image.
The idea of this project is that we have two types of neural networks that work together: There is one network A that assigns images to k clusters and k (simple) networks of type B perform a self-supervised task on those clusters. The goal of all the networks is to make the k networks of type B perform well on the task. The assumption is that clustering in semantically similar groups will help the networks of type B to perform well. This could be done on the MNIST dataset with B being linear classifiers and the task being rotation prediction.
The student designs a data augmentation network that transforms training images in such a way that image realism is preserved (e.g. with a constrained spatial transformer network) and the transformed images are more difficult to classify (trained via adversarial loss against an image classifier). The model will be evaluated for different data settings (especially in the low data regime), for example on the MNIST and CIFAR datasets.
People with sensory impairment (hearing, speech, vision) depend heavily on assistive technologies to communicate and navigate in everyday life. The mass production of media content today makes it impossible to manually translate everything into a common language for assistive technologies, e.g. captions or sign language.
In this project, the student employs a neural network to learn a representation for lip-movement in videos in an unsupervised fashion, possibly with an encoder-decoder structure where the decoder reconstructs the audio signal. This requires collecting a large dataset of videos (e.g. from YouTube) of speakers or conversations where lip movement is visible. The outcome will be a neural network that learns an audio-visual representation of lip movement in videos, which can then be leveraged to generate captions for hearing impaired persons.
Satellite images have many applications, e.g. in meteorology, geography, education, cartography and warfare. They are an accurate and detailed depiction of the surface of the earth from above. Although it is relatively simple to collect many satellite images in an automated way, challenges arise when processing them for use in navigation and cartography.
The idea of this project is to automatically convert an arbitrary satellite image, of e.g. a city, to a map of simple 2D shapes (streets, houses, forests) and label them with colors (semantic segmentation). The student will collect a dataset of satellite image and topological maps and train a deep neural network that learns to map from one domain to the other. The data could be obtained from a Google Maps database or similar.
Arthroscopy consists of challenging tasks and requires skills that even today, young surgeons still train directly throughout the surgery. Existing simulators are expensive and rarely available. Through the growing potential of virtual reality(VR) (head-mounted) devices for simulation and their applicability in the medical context, these devices have become a promising alternative that would be orders of magnitude cheaper and could be made widely available. To build a VR-based training device for arthroscopy is the overall aim of our project, as this would be of great benefit and might even be applicable in other minimally invasive surgery (MIS). This thesis marks a first step of the project with its focus to explore and compare well-known algorithms in a multi-view stereo (MVS) based 3D reconstruction with respect to imagery acquired by an arthroscopic camera. Simultaneously with this reconstruction, we aim to gain essential measures to compare the VR environment to the real world, as validation of the realism of future VR tasks. We evaluate 3 different feature extraction algorithms with 3 different matching techniques and 2 different algorithms for the estimation of the fundamental (F) matrix. The evaluation of these 18 different setups is made with a reconstruction pipeline embedded in a jupyter notebook implemented in python based on common computer vision libraries and compared with imagery generated with a mobile phone as well as with the reconstruction results of state-of-the-art (SOTA) structure-from-motion (SfM) software COLMAP and Multi-View Environment (MVE). Our comparative analysis manifests the challenges of heavy distortion, the fish-eye shape and weak image quality of arthroscopic imagery, as all results are substantially worse using this data. However, there are huge differences regarding the different setups. Scale Invariant Feature Transform (SIFT) and Oriented FAST Rotated BRIEF (ORB) in combination with k-Nearest Neighbour (kNN) matching and Least Median of Squares (LMedS) present the most promising results. Overall, the 3D reconstruction pipeline is a useful tool to foster the process of gaining measurements from the arthroscopic exploration device and to complement the comparative research in this context.
In recent years deep convolutional neural networks achieved a lot of progress. To train such a network a lot of data is required and in supervised learning algorithms it is necessary that the data is labeled. To label data there is a lot of human work needed and this takes a lot of time and money to be done. To avoid the inconveniences that come with this we would like to find systems that don’t need labeled data and therefore are unsupervised learning algorithms. This is the importance of unsupervised algorithms, even though their outcome is not yet on the same qualitative level as supervised algorithms. In this thesis we will discuss an approach of such a system and compare the results to other papers. A deep convolutional neural network is trained to learn the rotations that have been applied to a picture. So we take a large amount of images and apply some simple rotations and the task of the network is to discover in which direction the image has been rotated. The data doesn’t need to be labeled to any category or anything else. As long as all the pictures are upside down we hope to find some high dimensional patterns for the network to learn.
This thesis explores the prospect of artificial neural networks for image processing tasks. More specifically, it aims to achieve the goal of stitching multiple overlapping images to form a bigger, panoramic picture. Until now, this task is solely approached with ”classical”, hardcoded algorithms while deep learning is at most used for specific subtasks. This thesis introduces a novel end-to-end neural network approach to image stitching called StitchNet, which uses a pre-trained autoencoder and deep convolutional networks. Additionally to presenting several new datasets for the task of supervised image stitching with each 120’000 training and 5’000 validation samples, this thesis also conducts various experiments with different kinds of existing networks designed for image superresolution and image segmentation adapted to the task of image stitching. StitchNet outperforms most of the adapted networks in both quantitative as well as qualitative results.
The idea of inferring the emotional state of a subject by looking at their face is nothing new. Neither is the idea of automating this process using computers. Researchers used to computationally extract handcrafted features from face images that had proven themselves to be effective and then used machine learning techniques to classify the facial expressions using these features. Recently, there has been a trend towards using deeplearning and especially Convolutional Neural Networks (CNNs) for the classification of these facial expressions. Researchers were able to achieve good results on images that were taken in laboratories under the same or at least similar conditions. However, these models do not perform very well on more arbitrary face images with different head poses and illumination. This thesis aims to show the challenges of Facial Expression Recognition (FER) in this wild setting. It presents the currently used datasets and the present state-of-the-art results on one of the biggest facial expression datasets currently available. The contributions of this thesis are twofold. Firstly, I analyze three famous neural network architectures and their effectiveness on the classification of facial expressions. Secondly, I present two modifications of one of these networks that lead to the proposed STN-COV model. While this model does not outperform all of the current state-of-the-art models, it does beat several ones of them.
This work covers a new approach to 3D reconstruction. In traditional 3D reconstruction one uses multiple images of the same object to calculate a 3D model by taking information gained from the differences between the images, like camera position, illumination of the images, rotation of the object and so on, to compute a point cloud representing the object. The characteristic trait shared by all these approaches is that one can almost change everything about the image, but it is not possible to change the object itself, because one needs to find correspondences between the images. To be able to use different instances of the same object, we used a 3D DPM model that can find different parts of an object in an image, thereby detecting the correspondences between the different pictures, which we then can use to calculate the 3D model. To take this theory to practise, we gave a 3D DPM model, which was trained to detect cars, pictures of different car brands, where no pair of images showed the same vehicle and used the detected correspondences and the Factorization Method to compute the 3D point cloud. This technique leads to a completely new approach in 3D reconstruction, because changing the object itself was never done before.
This thesis explores the field of artificial neural networks with realistic looking visual outputs. It aims at morphing face pictures of a specific identity to look like another individual by only modifying key features, such as eye color, while leaving identity-independent features unchanged. Prior works have covered the topic of symmetric translation between two specific domains but failed to optimize it on faces where only parts of the image may be changed. This work applies a face masking operation to the output at training time, which forces the image generator to preserve colors while altering the face, fitting it naturally inside the unmorphed surroundings. Various experiments are conducted including an ablation study on the final setting, decreasing the baseline identity switching performance from 81.7% to 75.8 % whilst improving the average χ2 color distance from 0.551 to 0.434. The provided code-based software gives users easy access to apply this neural face swap to images and videos of arbitrary crop and brings Computer Vision one step closer to replacing Computer Graphics in this specific area.
The detection of interictal epileptiform discharges in the visual analysis of electroencephalography (EEG) is an important but very difficult, tedious, and time-consuming task. There have been decades of research on computer-assisted detection algorithms, most recently focused on using Convolutional Neural Networks (CNNs). In this thesis, we present the CNN Spike Detector, a convolutional neural network to detect spikes in intracranial EEG. Our dataset of 70 intracranial EEG recordings from 26 subjects with epilepsy introduces new challenges in this research field. We report cross-validation results with a mean AUC of 0.926 (+- 0.04), an area under the precision-recall curve (AUPRC) of 0.652 (+- 0.10) and 12.3 (+- 7.47) false positive epochs per minute for a sensitivity of 80%. A visual examination of false positive segments is performed to understand the model behavior leading to a relatively high false detection rate. We notice issues with the evaluation measures and highlight a major limitation of the common approach of detecting spikes using short segments, namely that the network is not capable to consider the greater context of the segment with regards to its origination. For this reason, we present the Context Model, an extension in which the CNN Spike Detector is supplied with additional information about the channel. Results show promising but limited performance improvements. This thesis provides important findings about the spike detection task for intracranial EEG and lays out promising future research directions to develop a network capable of assisting experts in real-world clinical applications.
This thesis explores the application of modern Natural Language Processing techniques to the detection of artificially generated videos of popular American politicians. Instead of focusing on detecting anomalies and artifacts in images and sounds, this thesis focuses on detecting irregularities and inconsistencies in the words themselves, opening up a new possibility to detect fake content. A novel, domain-adapted, pre-trained version of the language model BERT combined with several mechanisms to overcome severe dataset imbalances yielded the best quantitative as well as qualitative results. Additionally to the creation of the biggest publicly available dataset of English-speaking politicians consisting of 1.5 M sentences from over 1000 persons, this thesis conducts various experiments with different kinds of text classification and sequence processing algorithms applied to the political domain. Furthermore, multiple ablations to manage severe data imbalance are presented and evaluated.
The desire to use generative adversarial networks (GANs) for real-world tasks such as object segmentation or image manipulation is increasing as synthesis quality improves, which has given rise to an emerging research area called GAN inversion that focuses on exploring methods for embedding real images into the latent space of a GAN. In this work, we investigate different GAN inversion approaches using an existing generative model architecture that takes a completely unsupervised approach to object segmentation and is based on StyleGAN2. In particular, we propose and analyze algorithms for embedding real images into the different latent spaces Z, W, and W+ of StyleGAN following an optimization-based inversion approach, while also investigating a novel approach that allows fine-tuning of the generator during the inversion process. Furthermore, we investigate a hybrid and a learning-based inversion approach, where in the former we train an encoder with embeddings optimized by our best optimization-based inversion approach, and in the latter we define an autoencoder, consisting of an encoder and the generator of our generative model as a decoder, and train it to map an image into the latent space. We demonstrate the effectiveness of our methods as well as their limitations through a quantitative comparison with existing inversion methods and by conducting extensive qualitative and quantitative experiments with synthetic data as well as real images from a complex image dataset. We show that we achieve qualitatively satisfying embeddings in the W and W+ spaces with our optimization-based algorithms, that fine-tuning the generator during the inversion process leads to qualitatively better embeddings in all latent spaces studied, and that the learning-based approach also benefits from a variable generator as well as a pre-training with our hybrid approach. Furthermore, we evaluate our approaches on the object segmentation task and show that both our optimization-based and our hybrid and learning-based methods are able to generate meaningful embeddings that achieve reasonable object segmentations. Overall, our proposed methods illustrate the potential that lies in the GAN inversion and its application to real-world tasks, especially in the relaxed version of the GAN inversion where the weights of the generator are allowed to vary.
With the maturity of supervised learning technology, people gradually shift the research focus to the field of self-supervised learning. ”Momentum Contrast” (MoCo) proposes a new self-supervised learning method and raises the correct rate of self-supervised learning to a new level. Inspired by another article ”Representation Learning by Learning to Count”, if a picture is divided into four parts and passed through a neural network, it is possible to further improve the accuracy of MoCo. Different from the original MoCo, this MoCo variant (Multi-scale MoCo) does not directly pass the image through the encoder after the augmented images. Multi-scale MoCo crops and resizes the augmented images, and the obtained four parts are respectively passed through the encoder and then summed (upsampled version do not do resize to input but resize the contrastive samples). This method of images crop is not only used for queue q but also used for comparison queue k, otherwise the weights of queue k might be damaged during the moment update. This will further discussed in the experiments chapter between downsampled Multi-scale version and downsampled both Multi-scale version. Human beings also have the same principle of object recognition: when human beings see something they are familiar with, even if the object is not fully displayed, people can still guess the object itself with a high probability. Because of this, Multi-scale MoCo applies this concept to the pretext part of MoCo, hoping to obtain better feature extraction. In this thesis, there are three versions of Multi-scale MoCo, downsampled input samples version, downsampled input samples and contrast samples version and upsampled input samples version. The differences between these versions will be described in more detail later. The neural network architecture comparison includes ResNet50 , and the tested data set is STL-10. The weights obtained in pretext will be transferred to self-supervised learning, and in the process of self-supervised learning, the weights of other layers except the final linear layer are frozen without changing (these weights come from pretext).
In this thesis, we present several approaches for training a convolutional neural network using only unlabeled data. Our autonomously supervised learning algorithms are based on connections between image patch i. e. zoomed image and its original. Using the siamese architecture neural network we aim to recognize, if the image patch, which is input to the first neural network part, comes from the same image presented to the second neural network part. By applying transformations to both images, and different zoom sizes at different positions, we force the network to extract high level features using its convolutional layers. At the top of our siamese architecture, we have a simple binary classifier that measures the difference between feature maps that we extract and makes a decision. Thus, the only way that the classifier will solve the task correctly is when our convolutional layers are extracting useful representations. Those representations we can than use to solve many different tasks that are related to the data used for unsupervised training. As the main benchmark for all of our models, we used STL10 dataset, where we train a linear classifier on the top of our convolutional layers with a small amount of manually labeled images, which is a widely used benchmark for unsupervised learning tasks. We also combine our idea with recent work on the same topic, and the network called RotNet, which makes use of image rotations and therefore forces the network to learn rotation dependent features from the dataset. As a result of this combination we create a new procedure that outperforms original RotNet.
In the digital age of ever increasing data amassment and accessibility, the demand for scalable machine learning models effective at refining the new oil is unprecedented. Unsupervised representation learning methods present a promising approach to exploit this invaluable yet unlabeled digital resource at scale. However, a majority of these approaches focuses on synthetic or simplified datasets of images. What if a method could learn directly from natural Internet-scale image data? In this thesis, we propose a novel approach for unsupervised learning of object representations by mixing natural image scenes. Without any human help, our method mixes visually similar images to synthesize new realistic scenes using adversarial training. In this process the model learns to represent and understand the objects prevalent in natural image data and makes them available for downstream applications. For example, it enables the transfer of objects from one scene to another. Through qualitative experiments on complex image data we show the effectiveness of our method along with its limitations. Moreover, we benchmark our approach quantitatively against state-of-the-art works on the STL-10 dataset. Our proposed method demonstrates the potential that lies in learning representations directly from natural image data and reinforces it as a promising avenue for future research.
In computer vision, Visual Odometry is the problem of recovering the camera motion from a video. It is related to Structure from Motion, the problem of reconstructing the 3D geometry from a collection of images. Decades of research in these areas have brought successful algorithms that are used in applications like autonomous navigation, motion capture, augmented reality and others. Despite the success of these prior works in real-world environments, their robustness is highly dependent on manual calibration and the magnitude of noise present in the images in form of, e.g., non-Lambertian surfaces, dynamic motion and other forms of ambiguity. This thesis explores an alternative approach to the Visual Odometry problem via Deep Learning, that is, a specific form of machine learning with artificial neural networks. It describes and focuses on the implementation of a recent work that proposes the use of Recurrent Neural Networks to learn dependencies over time due to the sequential nature of the input. Together with a convolutional neural network that extracts motion features from the input stream, the recurrent part accumulates knowledge from the past to make camera pose estimations at each point in time. An analysis on the performance of this system is carried out on real and synthetic data. The evaluation covers several ways of training the network as well as the impact and limitations of the recurrent connection for Visual Odometry.
Owing to the existence of large labeled datasets, Deep Convolutional Neural Networks have ushered in a renaissance in computer vision. However, almost all of the visual data we generate daily - several human lives worth of it - remains unlabeled and thus out of reach of today’s dominant supervised learning paradigm. This thesis focuses on techniques that steer deep models towards learning generalizable visual patterns without human supervision. Our primary tool in this endeavor is the design of Self-Supervised Learning tasks, i.e., pretext-tasks for which labels do not involve human labor. Besides enabling the learning from large amounts of unlabeled data, we demonstrate how self-supervision can capture relevant patterns that supervised learning largely misses. For example, we design learning tasks that learn deep representations capturing shape from images, motion from video, and 3D pose features from multi-view data. Notably, these tasks’ design follows a common principle: The recognition of data transformations. The strong performance of the learned representations on downstream vision tasks such as classiﬁcation, segmentation, action recognition, or pose estimation validate this pretext-task design. This thesis also explores the use of Generative Adversarial Networks (GANs) for unsupervised representation learning. Besides leveraging generative adversarial learning to deﬁne image transformation for self-supervised learning tasks, we also address training instabilities of GANs through the use of noise. While unsupervised techniques can signiﬁcantly reduce the burden of supervision, in the end, we still rely on some annotated examples to ﬁne-tune learned representations towards a target task. To improve the learning from scarce or noisy labels, we describe a supervised learning algorithm with improved generalization in these challenging settings.
Computers represent images with pixels and each pixel contains three numbers for red, green and blue colour values. These numbers are meaningless for humans and they are mostly useless when used directly with classical machine learning techniques like linear classifiers. Interpretable representations are the attributes that humans understand: the colour of the hair, viewpoint of a car or the 3D shape of the object in the scene. Many computer vision tasks can be viewed as learning interpretable representations, for example a supervised classification algorithm directly learns to represent images with their class labels. In this work we aim to learn interpretable representations (or features) indirectly with lower levels of supervision. This approach has the advantage of cost savings on dataset annotations and the flexibility of using the features for multiple follow-up tasks. We made contributions in three main areas: weakly supervised learning, unsupervised learning and 3D reconstruction. In the weakly supervised case we use image pairs as supervision. Each pair shares a common attribute and differs in a varying attribute. We propose a training method that learns to separate the attributes into separate feature vectors. These features then are used for attribute transfer and classification. We also show theoretical results on the ambiguities of the learning task and the ways to avoid degenerate solutions. We show a method for unsupervised representation learning, that separates semantically meaningful concepts. We explain and show ablation studies how the components of our proposed method work: a mixing autoencoder, a generative adversarial net and a classifier. We propose a method for learning single image 3D reconstruction. It is done using only the images, no human annotation, stereo, synthetic renderings or ground truth depth map is needed. We train a generative model that learns the 3D shape distribution and an encoder to reconstruct the 3D shape. For that we exploit the notion of image realism. It means that the 3D reconstruction of the object has to look realistic when it is rendered from different random angles. We prove the efficacy of our method from first principles.
In this thesis, our focus is learning a controllable representation and applying the learned controllable feature representation on images synthesis, video generation, and even 3D reconstruction. We propose different methods to disentangle the feature representation in neural network and analyze the challenges in disentanglement such as reference ambiguity and shortcut problem when using the weak label. We use the disentangled feature representation to transfer attributes between images such as exchanging hairstyle between two face images. Furthermore, we study the problem of how another type of feature, sketch, works in a neural network. The sketch can provide shape and contour of an object such as the silhouette of the side-view face. We leverage the silhouette constraint to improve the 3D face reconstruction from 2D images. The sketch can also provide the moving directions of one object, thus we investigate how one can manipulate the object to follow the trajectory provided by a user sketch. We propose a method to automatically generate video clips from a single image input using the sketch as motion and trajectory guidance to animate the object in that image. We demonstrate the efficiency of our approaches on several synthetic and real datasets.
The complexity of any information processing task is highly dependent on the space where data is represented. Unfortunately, pixel space is not appropriate for the computer vision tasks such as object classification. The traditional computer vision approaches involve a multi-stage pipeline where at first images are transformed to a feature space through a handcrafted function and then consequenced by the solution in the feature space. The challenge with this approach is the complexity of designing handcrafted functions that extract robust features. The deep learning based approaches address this issue by end-to-end training of a neural network for some tasks that lets the network to discover the appropriate representation for the training tasks automatically. It turns out that image classification task on large scale annotated datasets yields a representation transferable to other computer vision tasks. However, supervised representation learning is limited to annotations. In this thesis we study self-supervised representation learning where the goal is to alleviate these limitations by substituting the classification task with pseudo tasks where the labels come for free. We discuss self-supervised learning by solving jigsaw puzzles that uses context as supervisory signal. The rational behind this task is that the network requires to extract features about object parts and their spatial configurations to solve the jigsaw puzzles. We also discuss a method for representation learning that uses an artificial supervisory signal based on counting visual primitives. This supervisory signal is obtained from an equivariance relation. We use two image transformations in the context of counting: scaling and tiling. The first transformation exploits the fact that the number of visual primitives should be invariant to scale. The second transformation allows us to equate the total number of visual primitives in each tile to that in the whole image. The most effective transfer strategy is fine-tuning, which restricts one to use the same model or parts thereof for both pretext and target tasks. We discuss a novel framework for self-supervised learning that overcomes limitations in designing and comparing different tasks, models, and data domains. In particular, our framework decouples the structure of the self-supervised model from the final task-specific finetuned model. Finally, we study the problem of multi-task representation learning. A naive approach to enhance the representation learned by a task is to train the task jointly with other tasks that capture orthogonal attributes. Having a diverse set of auxiliary tasks, imposes challenges on multi-task training from scratch. We propose a framework that allows us to combine arbitrarily different feature spaces into a single deep neural network. We reduce the auxiliary tasks to classification tasks and the multi-task learning to multi-label classification task consequently. Nevertheless, combining multiple representation space without being aware of the target task might be suboptimal. As our second contribution, we show empirically that this is indeed the case and propose to combine multiple tasks after the fine-tuning on the target task.
With the information explosion, a tremendous amount photos is captured and shared via social media everyday. Technically, a photo requires a finite exposure to accumulate light from the scene. Thus, objects moving during the exposure generate motion blur in a photo. Motion blur is an image degradation that makes visual content less interpretable and is therefore often seen as a nuisance. Although motion blur can be reduced by setting a short exposure time, an insufficient amount of light has to be compensated through increasing the sensor’s sensitivity, which will inevitably bring large amount of sensor noise. Thus this motivates the necessity of removing motion blur computationally. Motion deblurring is an important problem in computer vision and it is challenging due to its ill-posed nature, which means the solution is not well defined. Mathematically, a blurry image caused by uniform motion is formed by the convolution operation between a blur kernel and a latent sharp image. Potentially there are infinite pairs of blur kernel and latent sharp image that can result in the same blurry image. Hence, some prior knowledge or regularization is required to address this problem. Even if the blur kernel is known, restoring the latent sharp image is still difficult as the high frequency information has been removed. Although we can model the uniform motion deblurring problem mathematically, it can only address the camera in-plane translational motion. Practically, motion is more complicated and can be non-uniform. Non-uniform motion blur can come from many sources, camera out-of-plane rotation, scene depth change, object motion and so on. Thus, it is more challenging to remove non-uniform motion blur. In this thesis, our focus is motion blur removal. We aim to address four challenging motion deblurring problems. We start from the noise blind image deblurring scenario where blur kernel is known but the noise level is unknown. We introduce an efficient and robust solution based on a Bayesian framework using a smooth generalization of the 0−1 loss to address this problem. Then we study the blind uniform motion deblurring scenario where both the blur kernel and the latent sharp image are unknown. We explore the relative scale ambiguity between the latent sharp image and blur kernel to address this issue. Moreover, we study the face deblurring problem and introduce a novel deep learning network architecture to solve it. We also address the general motion deblurring problem and particularly we aim at recovering a sequence of 7 frames each depicting some instantaneous motion of the objects in the scene.
In this thesis we study the blind deconvolution problem. Blind deconvolution consists in the estimation of a sharp image and a blur kernel from an observed blurry image. Because the blur model admits several solutions it is necessary to devise an image prior that favors the true blur kernel and sharp image. Recently it has been shown that a class of blind deconvolution formulations and image priors has the no-blur solution as global minimum. Despite this shortcoming, algorithms based on these formulations and priors can successfully solve blind deconvolution. In this thesis we show that a suitable initialization can exploit the non-convexity of the problem and yield the desired solution. Based on these conclusions, we propose a novel “vanilla” algorithm stripped of any enhancement typically used in the literature. Our algorithm, despite its simplicity, is able to compete with the top performers on several datasets. We have also investigated a remarkable behavior of a 1998 algorithm, whose formulation has the no-blur solution as global minimum: even when initialized at the no-blur solution, it converges to the correct solution. We show that this behavior is caused by an apparently insignificant implementation strategy that makes the algorithm no longer minimize the original cost functional. We also demonstrate that this strategy improves the results of our “vanilla” algorithm. Finally, we present a study of image priors for blind deconvolution. We provide experimental evidence supporting the recent belief that a good image prior is one that leads to a good blur estimate rather than being a good natural image statistical model. By focusing the attention on the blur estimation alone, we show that good blur estimates can be obtained even when using images quite different from the true sharp image. This allows using image priors, such as those leading to “cartooned” images, that avoid the no-blur solution. By using an image prior that produces “cartooned” images we achieve state-of-the-art results on different publicly available datasets. We therefore suggests a shift of paradigm in blind deconvolution: from modeling natural image statistics to modeling cartooned image statistics.
This thesis investigates the problem of 3D reconstruction of a scene from 2D images. In particular, we focus on photometric stereo which is a technique that computes the 3D geometry from at least three images taken from the same viewpoint and under different illumination conditions. When the illumination is unknown (uncalibrated photometric stereo) the problem is ambiguous: different combinations of geometry and illumination can generate the same images. First, we solve the ambiguity by exploiting the Lambertian reflectance maxima. These are points defined on curved surfaces where the normals are parallel to the light direction. Then, we propose a solution that can be computed in closed-form and thus very efficiently. Our algorithm is also very robust and yields always the same estimate regardless of the initial ambiguity. We validate our method on real world experiments and achieve state-of-art results. In this thesis we also solve for the first time the uncalibrated photometric stereo problem under the perspective projection model. We show that unlike in the orthographic case, one can uniquely reconstruct the normals of the object and the lights given only the input images and the camera calibration (focal length and image center). We also propose a very efficient algorithm which we validate on synthetic and real world experiments and show that the proposed technique is a generalization of the orthographic case. Finally, we investigate the uncalibrated photometric stereo problem in the case where the lights are distributed near the scene. In this case we propose an alternating minimization technique which converges quickly and overcomes the limitations of prior work that assumes distant illumination. We show experimentally that adopting a near-light model for real world scenes yields very accurate reconstructions.