A modern look at peer-reviewing

Without a doubt this is a great time to be doing research in Computer Vision and Machine Learning. Research is progressing at an unprecedented pace thanks to the sharing of deep learning libraries and technical reports on arXiv, the availability of affordable GPU computing, and the development of practical large-scale machine learning methods as well as massive datasets to train models on. As a consequence, now many more researchers are working in deep learning than ever. This, of course, is great news for the advancement of research in these fields, as having more scientists working on the same problems increases the chances of solving them in a shorter time. In fact, today, it is not uncommon to find multiple concurrent works with extremely similar ideas and solutions.

This phenomenon has also increased the overall number of papers submitted to conferences and thus quickly increased the demand for more reviewers to keep the overall workload at bay. However, the pressure to increase the number of reviewers may lead to recruiting researchers who may not have had enough training and experience on the scientific method, on how to critically analyze papers, or enough time to learn about prior work. At the same time, checking for originality is now more difficult due to the large number of concurrent works to be aware of, and the limitations of time due to the higher level of competition in the scientific arena. The dangers of the scenario just depicted are that valid work gets rejected and weak work gets promoted, and that authors who can claim ownership of concurrent work may be randomly chosen.

The problems just mentioned above, however, may not sound very new. Many seasoned researchers may label them as mere reviewing noise. It is also clear that such review noise, as well as reviewer bias, is probably unavoidable. Nevertheless, it is also reasonable to think that the current reviewing process is not the best possible one yet. By examining the whole pipeline as a scientific problem, one might find ways to improve it and this might benefit the researchers in the field as well as the progress of science in general. If you want to directly skip to a solution, click here.

The path to publication

The assessment of what papers are ready for publication is carried out by reviewers and meta-reviewers (also called area chairs): the first ones read the submissions in detail and provide a technical review; the second ones oversee the quality of the reviews and use them to decide for acceptance or rejection.

In machine learning terms, the reviewers act as (paper) feature generators and the meta-reviewers act as (paper) classifiers given the features from a few reviewers. Features are, for example, the reviewer (self-assessed) confidence, the paper summary and how each review statement is supported by some evidence (eg, this work has been done before, see Author X et al [Y]). Given enough feature samples and enough conferences to train on, one could reasonably expect reviewers and meta-reviewers to eventually achieve a high performance, which would translate to a good paper selection process.

However, the few features that one can extract from a review may be biased and, as argued before, may suffer from high levels of noise. Thus, they may not be good enough to classify a paper correctly.

Towards better reviewing

Most conferences in computer vision and machine learning try to train their reviewers by providing thorough instructions and examples. Also, the review forms are designed to demonstrate an understanding of the paper (eg, provide a summary of the paper), a balanced judgment of the contributions (eg, describe both positive and negative aspects of the paper), and to provide a self-assessment of the reviewer (eg, confidence). Moreover, authors can rebut the review. In some cases, reviewers can also discuss in a group under the supervision of a meta-reviewer. In theory rebuttals and discussions should be the ultimate solution to remove noise and to get only high-quality reviews. The reality is quite different. Rebuttals are effective only when reviewers are engaged, have time and willingness to check and discuss, and are open minded and experienced. In practice, very often reviewers have a strong bias towards their initial position.

Training is further boosted by sharing best practices in reviewing. Conferences such as NIPS make (anonymized) reviews of accepted papers public on their website. ICLR has introduced an open reviewing process. Both the paper submission and its reviews are posted on a website after being anonymized. Others can post comments and even ask to become reviewers. The authors can respond to comments, post messages and update their paper. This is a very interesting reviewing process, but it might favor papers working on popular topics. These papers may get a great deal of high-quality reviews, while less popular topics may get low-quality reviews. This might be a valid strategy to allocate efforts, but it does not necessarily remove reviewing noise and it might push research towards a few topics only.

Finally, several conferences give reviewer awards to further incentivize researchers to perform a thorough job. All these innovations aim to improve the quality of reviews: Reviewers learn to be more objective and to provide constructive feedback. However, these measures do not give tools to meta-reviewers to detect and remove reviewing noise.

Assessing reviews: Bias or noise?

A simple way to handle reviewing noise and bias is to assess the reviews. For example, suppose that reviews about the same paper had their relative rating as an additional feature. If that were available, then meta-reviewers could better weigh the different opinions and perhaps distinguish a poor understanding of the paper from a valid criticism. Who can provide such relative ratings? The authors know everything about their paper, and can immediately see if the reviewers understood their work well, but they are also extremely biased towards favorable reviews. The meta-reviewer would need to read the paper to know with certainty the quality of the review, but this would defeat the whole point of appointing reviewers in the first place. Are we then stuck with reviewing noise?

Let us go back to the authors. They know their paper the best, so it would be great to have a way to extract their knowledge without a bias. One way to do so is to ask them to build a test set for the reviews in advance. For example, authors could provide a set of questions and answers about their paper along with their submission. Questions would be added to the review form and the correct answers would be visible only to the meta-reviewer. To avoid any additional workload the answers to each question could be in the form of multiple answers and processed automatically by the submission website.

The meta-reviewer would then obtain reviews with an associated relative rating, which would be unbiased, pertinent to the paper and the responsibility for designing the questions and answers would lie with the authors, who care the most about their submission.

Here is an example of a question and corresponding multiple answers:

Question #1
  • What is the main contribution of the paper?
  1. A computationally efficient solution to problem X.
  2. A novel image prior for problem X.
  3. A new dataset to solve problem X.
  4. A novel provably convergent solution to solve problem X.


Authors could develop strategies in writing questions and answers. For example, they could use them to guide the reviewers towards a better understanding of their paper. For example,

Question #2
  • What is the main difference between this work and reference [15]?
  1. This work proposes only a different loss function.
  2. This work proposes a different loss function and a different network architecture.
  3. This work proposes a different loss function and a different stochastic gradient descent.


Another strategy would be to ask specific questions that expose whether a reviewer has read all the paper in detail. For example,

Question #3
  • How do the authors train their subnetwork A?
  1. They perform data augmentation on ImageNet with affine transformations and use loss X.
  2. They perform data augmentation on ImageNet with random translations and use loss Y.
  3. They use the same training procedure as with subnetwork B.


Finally, there could also be a side benefit for the authors: Writing these questions and answers might force them to think about how clearly their paper conveys the desired content and perhaps encourage them to write it better.

Another interesting benefit is that conference organizers could collect statistics on the reviewers across all their papers and have a more objective and standardized measure of their performance.

Would it work?

Let us consider a few scenarios to see if this has a shot:

  1. The authors provide a poor set of questions and answers. In this case the answers of all reviewers are equivalent to a random guess and the relative rating is ⅓.
  2. The authors provide a good set of questions and answers, but the paper is poorly written and no reviewer understands it. The answers of all reviewers are equivalent to a random guess and the relative rating is ⅓ as above.
  3. The authors provide a good set of questions and answers, and the paper is clearly written. Reviewers who understand the paper well will consistently hit all the right answers. This is unlikely to happen by chance. Thus, some reviews may obtain a higher relative rating than others, thus allowing the meta-reviewer to differentiate between reviewers who fully understand the paper and those who do not.


We see that in the worst cases the proposed process becomes identical to the current one: When the relative ratings are identical, the meta-reviewer cannot use them to differentiate reviews. In the best case, however, the meta-reviewer has a new tool. Perhaps there are pitfalls that are now not apparent. Nonetheless, I believe that we should be able to improve the reviewing process if we tackle it as a scientific problem.