Evaluation of IPAQ recordings

As illustrated in (3) of Section 3, the storage requirements for merely saving the utterance of a passphrase can be significant. To overcome the storage limitations of this particular IPAQ in light of this requirement--and in particular, to permit saving multiple utterances in our user testing--we used a 1 GB IBM Microdrive $^{\rm TM}$ (in the compact flash expansion slot) as a stable store. However, to avoid recording noise from the Microdrive on disk seeks, the recordings are first written to a primary partition in volatile memory. When the memory capacity is reached, the Microdrive is automatically mounted, the data flushed to an ext2 file system on the drive, and then unmounted. In the event that a wireless connection can be established, the Microdrive can be replaced with a wireless network card, and the data written to a remote mount point.

The IPAQ was used to record utterances from ten users. All ten users were recorded saying the same passphrase multiple times, which in this case was the address of Carnegie Mellon University: ``Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, Pennsylvania 15213''. The user was approximately one foot away from the IPAQ's microphone. The user was required to wait for at least one second between pressing the ``record'' button of our recording application and speaking, so as to not interleave the voice signal with the device's attempts to perform automatic gain control. (Since the automatic gain control converges within

seconds on the IPAQ, we later discarded the first half second of recorded speech.) Each utterance was separated from the next by approximately one minute. The acoustic environment in which these utterances were recorded was a standard office environment, and as such, background noise was significant. Six recordings of each user--the ``training'' utterances--were used to determine the distinguishing features for that user. The remaining recordings for each user--the ``testing'' utterances, of which there were six from each user on average--were each used to generate a feature descriptor. Comparing each feature descriptor to the same user's distinguishing features, to determine the number of distinguishing features with which the feature descriptor was consistent, counted as a ``true speaker'' trial. Comparing each feature descriptor to another user's distinguishing features counted as an ``imposter'' trial.

The results of this analysis are shown in the left side of Figure 4. This graph demonstrates the average number of distinguishing features per user as a function of

, and the average number of these that the feature descriptor of a true speaker or an imposter matched.¹⁰ The error bars on the true speaker and imposter curves show one standard deviation above and below the average.

**Figure 4:** Evaluation of two data sets; see Sections 5.1 and 5.2.
$\begin{figure} \begin{center} \begin{tabular}{cc} \leavevmode \epsfxsize =3.... ...fxsize =3.2in \epsfbox{Jacobs.eps} \end{tabular} \end{center} \end{figure}$

There are several points worth noting about these results. First, the gap between the ``distinguishing features'' and ``true speaker'' points indicates the number $c_{\rm max}$ of error corrections that would need to be performed during the key regeneration process to achieve a reasonably low false reject rate. For example, if

, then $c_{\rm max} \approx 5$ should achieve a reasonable false reject rate, and correcting

errors is feasible on today's devices (see Section 4.3). Unfortunately, this data suggests that choosing

yields fewer distinguishing features than we would like for security ( $d/m \approx 0.5$ only). A second point worth noting is that the human imposters, even when saying the same passphrase as the true user, did not match significantly more of the true user's distinguishing features than if they had simply guessed a random feature descriptor (shown by the ``random guessing'' line in Figure 4, which is simply half the ``distinguishing features'' line).