Boosting Self-Supervised Learning via Knowledge Transfer

Mehdi Noroozi1
Ananth Vinjimoor2
Paolo Favaro1
Hamed Pirsiavash2
1University of Bern
2University of Maryland, Baltimore County

Abstract

In self-supervised learning, one trains a model to solve a so-called pretext task on a dataset without the need for human annotation. The main objective, however, is to transfer this model to a target domain and task. Currently, the most effective transfer strategy is fine-tuning, which restricts one to use the same model or parts thereof for both pretext and target tasks. In this paper, we present a novel framework for self-supervised learning that overcomes limitations in designing and comparing different tasks, models, and data domains. In particular, our framework decouples the structure of the self-supervised model from the final task-specific fine-tuned model. This allows us to: 1) quantitatively assess previously incompatible models including handcrafted features; 2) show that deeper neural network models can learn better representations from the same pretext task; 3) transfer knowledge learned with a deep model to a shallower one and thus boost its learning. We use this framework to design a novel self-supervised task, which achieves state-of-the-art performance on the common benchmarks in PASCAL VOC 2007, ILSVRC12 and Places by a significant margin. Our learned features shrink the mAP gap between models trained via self-supervised learning and supervised learning from 5.9% to 2.6% in object detection on PASCAL VOC 2007.

Code

The Jigsaw++ code is available in GitHub.

Paper

CVPR 2018, hosted by arXiv.

Pre-trained Models

The .zip file consists of I) the pre-trained caffe model and deploy and II) The Imagenet pseudo labels of both train and validation sets. The detection performance is obtained by mulitscale train/test fine-tuning using Fast-rcnn framerwork for 150K iterations, initial learning rate of 0.001 and dropping learning rate by 0.1 at every 50K iterations. The maximum image size of train and test is set to 2000. The pre-trained model on Imagnet labels achieves 59.8%(mAP) with the same settings.
The Jigsaw++ task, without cluster classification(the second row), is trained using stride of 2 in conv1. However, only convolutional layers with stride of 4 in conv1 are used for all the transfer learnings. Please use the right deploy file if you want to reproduce the transfer learning results.

Method Pretext architecture Cluster classification Detection performance (%mAP) Files
Context Alexnet 53.4 context_acc
HOG - 53.5 hog_cc
Jigsaw++ Alexnet 55.7 jigsawpp
Jigsaw++ Alexnet 55.8 jigsawpp_acc
Jigsaw++ VGG 57.2 jigsawpp_vcc