Subsections


Speex narrowband mode

This section looks at how Speex works for narrowband ( $ 8\:\mathrm{kHz}$ sampling rate) operation. The frame size for this mode is $ 20\:\mathrm{ms}$ , corresponding to 160 samples. Each frame is also subdivided into 4 sub-frames of 40 samples each.

Also many design decisions were based on the original goals and assumptions:


Whole-Frame Analysis

In narrowband, Speex frames are 20 ms long (160 samples) and are subdivided in 4 sub-frames of 5 ms each (40 samples). For most narrowband bit-rates (8 kbps and above), the only parameters encoded at the frame level are the Line Spectral Pairs (LSP) and a global excitation gain $ g_{frame}$ , as shown in Fig. 3. All other parameters are encoded at the sub-frame level.

Linear prediction analysis is performed once per frame using an asymmetric Hamming window centered on the fourth sub-frame. Because linear prediction coefficients (LPC) are not robust to quantization, they are first are converted to line spectral pairs (LSP). The LSP's are considered to be associated to the $ 4^{th}$ sub-frames and the LSP's associated to the first 3 sub-frames are linearly interpolated using the current and previous LSP coefficients. The LSP coefficients and converted back to the LPC filter $ \hat{A}(z)$ . The non-quantized interpolated filter is denoted $ A(z)$ and can be used for the weighting filter $ W(z)$ because it does not need to be available to the decoder.

To make Speex more robust to packet loss, no prediction is applied on the LSP coefficients prior to quantization. The LSPs are encoded using vector quantizatin (VQ) with 30 bits for higher quality modes and 18 bits for lower quality.

Figure 3: Frame open-loop analysis
\includegraphics[width=0.35\paperwidth]{speex_analysis}

Sub-Frame Analysis-by-Synthesis

Figure 4: Analysis-by-synthesis closed-loop optimization on a sub-frame.
\includegraphics[width=0.4\paperwidth]{speex_abs}

The analysis-by-synthesis (AbS) encoder loop is described in Fig. 4. There are three main aspects where Speex significantly differs from most other CELP codecs. First, while most recent CELP codecs make use of fractional pitch estimation with a single gain, Speex uses an integer to encode the pitch period, but uses a 3-tap predictor (3 gains). The adaptive codebook contribution $ e_{a}[n]$ can thus be expressed as:

$\displaystyle e_{a}[n]=g_{0}e[n-T-1]+g_{1}e[n-T]+g_{2}e[n-T+1]$ (2)

where $ g_{0}$ , $ g_{1}$ and $ g_{2}$ are the jointly quantized pitch gains and $ e[n]$ is the codec excitation memory. It is worth noting that when the pitch is smaller than the sub-frame size, we repeat the excitation at a period $ T$ . For example, when $ n-T+1\geq0$ , we use $ n-2T+1$ instead. In most modes, the pitch period is encoded with 7 bits in the $ \left[17,144\right]$ range and the $ \beta_{i}$ coefficients are vector-quantized using 7 bits at higher bit-rates (15 kbps narrowband and above) and 5 bits at lower bit-rates (11 kbps narrowband and below).

Many current CELP codecs use moving average (MA) prediction to encode the fixed codebook gain. This provides slightly better coding at the expense of introducing a dependency on previously encoded frames. A second difference is that Speex encodes the fixed codebook gain as the product of the global excitation gain $ g_{frame}$ with a sub-frame gain corrections $ g_{subf}$ . This increases robustness to packet loss by eliminating the inter-frame dependency. The sub-frame gain correction is encoded before the fixed codebook is searched (not closed-loop optimized) and uses between 0 and 3 bits per sub-frame, depending on the bit-rate.

The third difference is that Speex uses sub-vector quantization of the innovation (fixed codebook) signal instead of an algebraic codebook. Each sub-frame is divided into sub-vectors of lengths ranging between 5 and 20 samples. Each sub-vector is chosen from a bitrate-dependent codebook and all sub-vectors are concatenated to form a sub-frame. As an example, the 3.95 kbps mode uses a sub-vector size of 20 samples with 32 entries in the codebook (5 bits). This means that the innovation is encoded with 10 bits per sub-frame, or 2000 bps. On the other hand, the 18.2 kbps mode uses a sub-vector size of 5 samples with 256 entries in the codebook (8 bits), so the innovation uses 64 bits per sub-frame, or 12800 bps.

Bit allocation

There are 7 different narrowband bit-rates defined for Speex, ranging from 250 bps to 24.6 kbps, although the modes below 5.9 kbps should not be used for speech. The bit-allocation for each mode is detailed in table 3. Each frame starts with the mode ID encoded with 4 bits which allows a range from 0 to 15, though only the first 7 values are used (the others are reserved). The parameters are listed in the table in the order they are packed in the bit-stream. All frame-based parameters are packed before sub-frame parameters. The parameters for a certain sub-frame are all packed before the following sub-frame is packed. Note that the ``OL'' in the parameter description means that the parameter is an open loop estimation based on the whole frame.


Table 3: Bit allocation for narrowband modes
Parameter Update rate 0 1 2 3 4 5 6 7 8
Wideband bit frame 1 1 1 1 1 1 1 1 1
Mode ID frame 4 4 4 4 4 4 4 4 4
LSP frame 0 18 18 18 18 30 30 30 18
OL pitch frame 0 7 7 0 0 0 0 0 7
OL pitch gain frame 0 4 0 0 0 0 0 0 4
OL Exc gain frame 0 5 5 5 5 5 5 5 5
Fine pitch sub-frame 0 0 0 7 7 7 7 7 0
Pitch gain sub-frame 0 0 5 5 5 7 7 7 0
Innovation gain sub-frame 0 1 0 1 1 3 3 3 0
Innovation VQ sub-frame 0 0 16 20 35 48 64 96 10
Total frame 5 43 119 160 220 300 364 492 79


So far, no MOS (Mean Opinion Score) subjective evaluation has been performed for Speex. In order to give an idea of the quality achievable with it, table 4 presents my own subjective opinion on it. It sould be noted that different people will perceive the quality differently and that the person that designed the codec often has a bias (one way or another) when it comes to subjective evaluation. Last thing, it should be noted that for most codecs (including Speex) encoding quality sometimes varies depending on the input. Note that the complexity is only approximate (within 0.5 mflops and using the lowest complexity setting). Decoding requires approximately 0.5 mflops in most modes (1 mflops with perceptual enhancement).


Table 4: Quality versus bit-rate
Mode Quality Bit-rate (bps) mflops Quality/description
0 - 250 0 No transmission (DTX)
1 0 2,150 6 Vocoder (mostly for comfort noise)
2 2 5,950 9 Very noticeable artifacts/noise, good intelligibility
3 3-4 8,000 10 Artifacts/noise sometimes noticeable
4 5-6 11,000 14 Artifacts usually noticeable only with headphones
5 7-8 15,000 11 Need good headphones to tell the difference
6 9 18,200 17.5 Hard to tell the difference even with good headphones
7 10 24,600 14.5 Completely transparent for voice, good quality music
8 1 3,950 10.5 Very noticeable artifacts/noise, good intelligibility
9 - - - reserved
10 - - - reserved
11 - - - reserved
12 - - - reserved
13 - - - Application-defined, interpreted by callback or skipped
14 - - - Speex in-band signaling
15 - - - Terminator code



Perceptual enhancement

This section was only valid for version 1.1.12 and earlier. It does not apply to version 1.2-beta1 (and later), for which the new perceptual enhancement is not yet documented.

This part of the codec only applies to the decoder and can even be changed without affecting inter-operability. For that reason, the implementation provided and described here should only be considered as a reference implementation. The enhancement system is divided into two parts. First, the synthesis filter $ S(z)=1/A(z)$ is replaced by an enhanced filter:

$\displaystyle S'(z)=\frac{A\left(z/a_{2}\right)A\left(z/a_{3}\right)}{A\left(z\right)A\left(z/a_{1}\right)}$

where $ a_{1}$ and $ a_{2}$ depend on the mode in use and $ a_{3}=\frac{1}{r}\left(1-\frac{1-ra_{1}}{1-ra_{2}}\right)$ with $ r=.9$ . The second part of the enhancement consists of using a comb filter to enhance the pitch in the excitation domain.

Jean-Marc Valin 2007-05-23