Neural Network Architecture

Next: Adaptation Up: Supervised learning Previous: Supervised learning

Neural Network Architecture

**Figure 5:** Neural network uses a convolutional architecture to integrate the different sources of information and determine the maximally salient object.
$\includegraphics[width=3.0in]{convolutions}$

Our tracking algorithm uses the convolutional neural network architecture shown in Figure 5 to locate the salient objects in its visual and auditory fields. The YUVD input images are filtered with separate $16 \times 16$ kernels, denoted by W_Y, W_U, W_V, and W_D respectively. This results in the filtered images $\bar{Y^s}$ , $\bar{U^s}$ , $\bar{V^s}$ , $\bar{D^s}$ :

$\displaystyle \bar{Z^s}(i,j)$	=	$\displaystyle W_Z \circ Z^s$
	=	$\displaystyle \sum_{i',j'} W_Z(i',j')\, Z^s(i+i',j+j')$	(2)

where s denotes the scale resolution of the inputs, and Zrepresents any one of the Y, U, V, or D channels. These filtered images correspond to a single layer of hidden units in the neural network. The hidden units are then combined with the one-dimenional auditory correlation function A(j) to form the saliency map X^s in the following manner:

X^s(i,j)	=	$\displaystyle c_Y\,g[\bar{Y}^s(i,j)] + c_U\,g[\bar{U}^s(i,j)] +$
		$\displaystyle c_V\,g[\bar{V}^s(i,j)] + c_D\,g[\bar{D}^s(i,j)] +$
		$\displaystyle c_A\,g[A(j)] + c_0$	(3)

where the sigmoidal nonlinearity is given by $g(x) = \tanh(x)$ . Thus, the saliency X^s is computed on a pixel-by-pixel basis using a nonlinear combination of hidden units. The relative importance of the different luminance, chromatic, motion, and auditory channels in the overall saliency of an object is given by the scalar variables c_Y, c_U, c_V, c_D, and c_A.

With the bias term c₀, the function g[X^s(i,j)] may be interpreted as the relative probability that the tracked object appears in location (i,j) at input resolution s. The final output of the neural network is then determined in a competitive manner by finding the location (i_m,j_m) and scale s_m of the best possible match:

$\begin{displaymath}g[X_m] = g[X^{s_m}(i_m,j_m)] = \max_{i,j,s} g[X^s(i,j)] . \end{displaymath}$

(4)

After processing the visual and auditory inputs in this manner, head movements are generated in order to keep the maximally salient object located near the center of the field of view.

Next: Adaptation Up: Supervised learning Previous: Supervised learning

1999-03-20