Check out the new USENIX Web site. next up previous
Next: Adaptation Up: Supervised learning Previous: Supervised learning

Neural Network Architecture


  
Figure 5: Neural network uses a convolutional architecture to integrate the different sources of information and determine the maximally salient object.
\includegraphics[width=3.0in]{convolutions}

Our tracking algorithm uses the convolutional neural network architecture shown in Figure 5 to locate the salient objects in its visual and auditory fields. The YUVD input images are filtered with separate $16 \times 16$ kernels, denoted by WY, WU, WV, and WD respectively. This results in the filtered images $\bar{Y^s}$, $\bar{U^s}$, $\bar{V^s}$, $\bar{D^s}$:


$\displaystyle \bar{Z^s}(i,j)$ = $\displaystyle W_Z \circ Z^s$  
  = $\displaystyle \sum_{i',j'} W_Z(i',j')\, Z^s(i+i',j+j')$ (2)

where s denotes the scale resolution of the inputs, and Zrepresents any one of the Y, U, V, or D channels. These filtered images correspond to a single layer of hidden units in the neural network. The hidden units are then combined with the one-dimenional auditory correlation function A(j) to form the saliency map Xs in the following manner:


Xs(i,j) = $\displaystyle c_Y\,g[\bar{Y}^s(i,j)] + c_U\,g[\bar{U}^s(i,j)] +$  
    $\displaystyle c_V\,g[\bar{V}^s(i,j)] + c_D\,g[\bar{D}^s(i,j)] +$  
    $\displaystyle c_A\,g[A(j)] + c_0$ (3)

where the sigmoidal nonlinearity is given by $g(x) = \tanh(x)$. Thus, the saliency Xs is computed on a pixel-by-pixel basis using a nonlinear combination of hidden units. The relative importance of the different luminance, chromatic, motion, and auditory channels in the overall saliency of an object is given by the scalar variables cY, cU, cV, cD, and cA.

With the bias term c0, the function g[Xs(i,j)] may be interpreted as the relative probability that the tracked object appears in location (i,j) at input resolution s. The final output of the neural network is then determined in a competitive manner by finding the location (im,jm) and scale sm of the best possible match:


\begin{displaymath}g[X_m] = g[X^{s_m}(i_m,j_m)] = \max_{i,j,s} g[X^s(i,j)] .
\end{displaymath} (4)

After processing the visual and auditory inputs in this manner, head movements are generated in order to keep the maximally salient object located near the center of the field of view.


next up previous
Next: Adaptation Up: Supervised learning Previous: Supervised learning

1999-03-20