Propagation of Interpretations Based on Graded Resolution Input

Roger A. Browse and Marion G. Rodrigues


Abstract

The graded resolution and selectional aspects of human vision have only rarely been considered in the development of computational vision systems. Such considerations may yield a great advantage for machine vision because they provide a means of avoiding the fine detailed processing of entire images. This paper presents a process called propagation which permits the integration of image understanding based on distinct levels of resolution such as might be available in a single fixation of a human eye. Two examples of this propagation process are given: one which deals with image content at the level of the primal sketch, and the other which deals with interpretations based on object models. The results obtained with peripheral information may be examined to predict locations which, if fixated, have the greatest probability of allowing propagation. These are the candidates for subsequent eye position location.

Introduction

Computational vision research pursues the long range goal of the development of a generalized vision system capable of performance comparable to that of biologic systems. In light of this goal, it is not surprising that through the brief history of Computer Vision, there has been increasing consideration of the operation of biologic systems in the design of machine perception (e.g. Marr). Despite this trend, there remain two widespread assumptions about machine vision which stand in sharp contrast to the realities of human perception.

The Uniform Retina Assumption

Due no doubt to the nature of available digitizing equipment, computer vision research has almost unanimously chosen to assume that a common resolution level is available at all points of the field of view. Such uniform images may be subjected to uniform processing, yielding an immensely important simplification from the case of human vision in which acuity falls off sharply even as close as 2 degrees from the foveal center. While this simplification appears important for commercial applications of machine vision research, it perpetuates the misconception that all locations of an image must be processed with equal scrutiny in order to permit effective vision. Edge detection schemes employing center-surround mechanisms often refer to the work of Wilson and Bergen as a statement of compatibility with human vision for the convolution of images with operators of different scale, without ever noting that Wilson and Bergen's results included a gradation in scale of about 40% within 4 degrees of the foveal center.

The human system has probably evolved the graded resolution structure of the visual system in order to permit both high resolution and a broad field of view within limited processing resources. It seems reasonable to expect that the processes of vision , and the knowledge involved in vision, are highly tuned to the selective use of the fovea, and to the accomplishment of integration of visual input obtained at many different scales into a coherent percept.

The research described in this paper is based on the belief that the time has come to stop considering this graded structure of the retina as a unfortunate detail of the biologic implementation, and to move towards developing the computational underpinings of the operation of realistic biologic vision.

The Appropriate Image Assumption

Computational vision systems usually make the assumption that the image under analysis happens to contain an object or scene of the type that the system expects, usually nicely centered. Of course, natural vision involves the requirement to align the sensors with the aspects of the scene that are of interest. In particular, any vision system which uses grades resolution must necessarily face the problem of selecting locations at which to apply central vision. During picture viewing humans reposition their eyes about 3 times per second selecting important information for detailed analysis and incorporation into a percept which seems independent of these movements.

During the early period of computer vision research the assumption of appropriate images was certainly justified in order to facilitate the investigation of the interpretation process. More recently, however, robotics has emerged as a major application for machine vision, and mobile robots will certainly require the ability to select areas to image from within their environment.

In this paper we present a process called propagation which allows the fine level understanding available at the fovea to be used to draw inferences about the nature of information available peripherally, at lower resolution. Two versions of this process are described which operate domain-independently and domain dependent. In the final section we provide an outline of how the results of the propagation processes may be used to select locations for the application of foveal vision.