Check out the new USENIX Web site. next up previous
Next: Encoding Scheme Up: Toward Speech-Generated Cryptographic Keys Previous: Background


Front-end Signal Processing

As described in Section 2, the front-end signal processing performed by the device is critical for efficiently generating a key from the user's utterance of her passphrase. Intuitively, the goal of this signal processing is to translate the sound uttered by the user--which is received in the device as a series of amplitude samples from its microphone and analog-to-digital (A/D) converter--into a representation suited for the generation of a feature descriptor (using the algorithms of [16]). Ideally, this signal processing should yield a representation that is as ``clean'' as possible, in that inter-word silence and background noise affect this representation as little as possible. The less silence and background noise in the representation after signal processing, the more consistent the user's utterances will be (thereby increasing $d$ and the security of the scheme) and/or the less error correction will be necessary in the later stages of key generation (i.e., the smaller $c_{\rm max}$ and expression (2) can be).

Of course, the benefits of signal processing in terms of producing a clean representation of the user's utterance must be weighed against the computational cost of the signal processing itself. The challenge is to find the right balance of eliminating environmental effects early via signal processing, versus relying on the error correction in the key generation step to compensate for the effects of noise and silence in the user's utterance. In this section we describe the series of signal processing steps that we believe best achieves this balance. These steps are pictured in Figure 1 and described textually below.

Figure 1: Outline of the front-end modules used for capturing the speech and processing the signal to generate a sequence of frames comprising the voice-only portions in the utterance.
\begin{figure}
\begin{center}
\vspace{0.25in}
\leavevmode \epsfxsize =3in \epsfbox{frontend.eps}
\end{center}
\end{figure}

As the speaker utters her passphrase, the signal is sampled at a predefined sampling rate, which is the number of times the amplitude of the analog signal is recorded per second. The minimum sampling rate supported by our target platform, the IPAQ$^{\rm TM}$ (see Section 5.1), is $32$ kHz; i.e., $32,000$ samples are taken per second.7 Each sample is represented by a fixed number of bits. Obviously, the more bits there are per sample, the better is the resolution of the reconstructed signal, but the more storage is required for saving and processing the utterance. In our implementation, we represent the signal using $16$ bits per sample. Therefore, the amount of storage required per second of recorded speech is

\begin{displaymath}
32000 \frac{\rm samples}{\rm second}  \times 
2 \frac{\rm bytes}{\rm sample}  =  64 \frac{\rm kilobytes}{\rm second}
\end{displaymath} (3)

Since the utterance of one of our tested passwords can easily be $6$ seconds or more, the storage requirements for processing even a single utterance can be significant for a resource-constrained device. This is especially true since, as we have found by experience, writing to stable storage while the recording is ongoing can introduce noise into the recording. In our case, this particularly posed an issue for our experimental evaluation in which we needed to acquire many samples from the speaker; see Section 5.1.

To make subsequent processing on the device efficient, our implementation first makes several modifications to the recorded speech to reduce the number of samples. For example, we down-sample the $32,000$ samples per second to only $8,000$ samples per second, effectively achieving an $8$ kHz sampling rate. For most voice-related applications, a sampling rate of 8 kHz is sufficient to reconstruct the speech signal. In fact, nearly all phone companies in North America use a sampling rate of 8 kHz [21].

Down sampling must be performed with some care, however, due to the sampling theorem [20]. The sampling theorem tells us that the sampling rate must exceed twice the signal frequency to guarantee an accurate and unique representation of the signal. Failure to obey this rule can result in an effect called aliasing, in which higher frequencies are falsely reconstructed as lower frequencies. Down sampling to $8$ kHz therefore implies that only frequencies up to $4$ kHz can be accurately represented by the samples. Therefore, when down sampling to $8$ kHz we use a low-pass digital filter with cutoff at $4$ kHz to strip higher frequencies from the signal. That is, this filter takes sound of any frequencies as input and passes only the frequencies of $4$ kHz or less.

After down sampling, the signal is broken into $30$ millisecond (ms) windows, each overlapping the next by $10$ ms. Within each window are $240$ samples (since $8,000$ samples/second $\times$ $0.03$ seconds $=$ $240$ samples). Overlapping windows by $10$ ms avoids discontinuities from one window to the next, and additional smoothing is performed within each window to yield as smooth a signal as possible.

The goal of the next signal processing steps is to derive a frame from each window. A frame is a $12$-dimensional vector of real numbers called cepstral coefficients [20], which have been shown to be a very robust and reliable feature set for speech and speaker recognition. These cepstral coefficients are obtained using a a technique called autocorrelation analysis. The basic premise behind autocorrelation analysis is that each speech sample can be approximated as a linear combination of past speech samples. The extraction of a frame of cepstral coefficients using autocorrelation analysis involves highly specialized algorithms that we do not detail here, but that are standard in speech processing (linear predictive coding [10]).

A side effect of generating frames is a calculation of the energy of the signal per window. The energy of a window is proportional to the average amplitudes of the samples in the window, measured in decibels (dB). Energy can be used to remove frames representing silence (which has very low energy) from the frame sequence, via a process called endpoint detection [13]. The silence portions of the feature frames are then removed and the voice portions concatenated.

Final modifications to the frame sequence are made via a technique called cepstral mean subtraction. In this technique, the component-wise mean over all frames is computed and subtracted from every frame in the sequence. Intuitively, if the mean vector is representative of the background noise or the channel characteristics in the recording environment, then subtracting that mean vector from all the frames yields a frame sequence that is more robust in representing the user's voice.

After this, the speech data is segmented and converted to a sequence of bits (a feature descriptor) as described in [16]. This feature descriptor is used to regenerate the secret key from the previously stored table of shares, as described in Section 2.


next up previous
Next: Encoding Scheme Up: Toward Speech-Generated Cryptographic Keys Previous: Background
fabian 2002-08-28