Computer vision is a field with a very wide variety of research topics, each involving somehow visual data in one form or another. What should computer vision solve in the end? I bet most scientists in this field will have very different ideas on what the fundamental problems are. However, possibly, most might agree that computer vision should find out how we can build machines that "see". This might have different interpretations, so in this post I'd like to clarify my own take on what this means.
In my view, the ultimate quest of computer vision is:
Is there a learning rule that would allow a machine to exploit vision (and other senses) to develop useful models of the environment?
Also, would this rule be so general that it can be observed as the basic mechanism used by animals to build useful models of the environment? If this were true then we could take inspiration from how animals learn. Of course, we should also expect different learning outcomes due to the difference between the resources available in a machine and those in an animal.
I make the assumption that evolution would make such general learning rule emerge in animals. Evolution would select, among many, systems that are robust to environmental changes, so that, no matter where they are deployed, they can quickly adapt and build nontrivial knowledge. Although animals with prior knowledge exist (e.g., chicken), there are also animals, including humans, that are born with very few immediate skills (e.g., crows). An interesting pattern is that animals with no initial skills seem to develop many more and more advanced ones than those with skills at the outset. These skills seem to be learned over time through the continuous observation of the environment. In particular, animals seem to learn about their environment mostly without external supervision. Therefore, I would qualify the quest as that of finding an unsupervised learning rule.
Finally, our belief of the correctness of the unsupervised learning rule would increase with the number of capabilities observed in animals induced as a side effect. For example, consider as learning rule in a human the prediction of future signals (e.g., visual and acoustic stimuli). We observe that humans develop abstract thinking. We can consider abstract thinking as a side effect of the learning rule, because this leads to better predictions. This reinforces our belief in such a learning rule.
If we consider an animal without initial skills, learning is more a necessary than a desirable ability. Therefore, one could postulate that an internal mechanism to encourage learning might be based more on the absence of a negative reward rather than the presence of a positive one. I conjecture that this negative reward is given by surprise and, therefore, that animals do all they can to avoid being surprised. Surprise is often associated with a poor understanding of the environment, and would result in fatal outcomes in the presence of a predator. So evolution might have selected animals that are built to respond negatively to surprise.
I formalize surprise as the inability to predict future observations through the senses. Thus, to reduce surprise an animal would have to somehow build an ability to predict what will happen next. This mechanism requires an internal generative model of the environment, which, when inquired, can produce plausible outcomes. Therefore, a possible unsupervised learning rule could be that of constantly improving some internal generative models of the environment by minimizing the amount of surprise, i.e., by minimizing the error in predicting the future visual, audio and other sensory data.
What kind of models would be suitable to make accurate predictions? Before we answer this question we need to ask ourselves a more basic one: To what degree is it possible to predict future sensory observations? There are observations that are stochastic in nature or depend on unobserved/able factors. For example, the exact position of a tree shaken by the wind is not predictable. However, some properties of its coarse oscillating motion are predictable. This suggests that animals might have to also learn at what level of accuracy predictions are possible. A possible way to do so is to make multiple predictions, each at a different level of coarseness or, better, abstraction. This could be achieved by using a hierarchical representation of the observations (such as in neural networks). Thus, a suitable mechanism could be, for example, not only to ask the generative model to predict the future visual data, but also to predict all the future intermediate representations in the hierarchical structure.
Another important aspect in the definition of the generative models by taking inspiration from animals, is that resources are limited (energy, memory, space). Thus, one might expect that learning might favor models that predict well with less computational effort. This means that simple and efficient models would be prioritized over more complex ones. Also, it means that models that better capture the structure of data, and thus better generalize, will be preferred. As an example, it might be more efficient to represent a scene with multiple objects by factorizing the representation of each object separately rather than by having a single representation that captures all object configurations (whose complexity grows exponentially). Therefore, one might expect that representations of the environment will tend to be factorized. For example, objects may be automatically identified through a series of macro attributes such as category/identity, size, shape, location, pose, behavior and so on.
Another aspect is that our learning rule might lead to developing more and more abstract models. That is a way both to do more with less energy, memory and space, and to predict better and further. In particular, this could lead to meta learning, where a model learns to predict and simulate other models. In practice, we might expect an animal to learn to predict the actions of a prey by simulating its thought process, so that it can catch it.
Finally, I believe that an important aspect of learning a generative model is that causality should be naturally part of it. Causality gives us the ability to predict at a very high level of abstraction (e.g., what would happen if ...?). The question is then: Do animals learn causality only through physical interaction/intervention? Are there non-restrictive assumptions that allow one to instead learn about causality only through passive observation? Are there learning mechanisms to capture causality without intervention that can be transferred to a machine?