Subsections

Programming with Speex (the libspeex API)

This section explains how to use the Speex API. Examples of code can also be found in Appendix B and the complete API documentation is included in the Documentation section of the Speex website (http://www.speex.org/).

Encoding

In order to encode speech using Speex, one first needs to:

: #include <speex/speex.h>

Then a Speex bit-packing struct must be declared as:

: SpeexBits bits;

along with a Speex encoder state

: void *enc_state;

The two are initialized by:

: speex_bits_init(&bits);
enc_state = speex_encoder_init(&speex_nb_mode);

For wideband coding, speex_nb_mode will be replaced by speex_wb_mode. In most cases, you will need to know the frame size used by the mode you are using. You can get that value in the frame_size variable (expressed in samples, not bytes) with:

: speex_encoder_ctl(enc_state,SPEEX_GET_FRAME_SIZE,&frame_size);

In practice, frame_size will correspond to 20 ms when using 8, 16, or 32 kHz sampling rate. There are many parameters that can be set for the Speex encoder, but the most useful one is the quality parameter that controls the quality vs bit-rate tradeoff. This is set by:

: speex_encoder_ctl(enc_state,SPEEX_SET_QUALITY,&quality);

where quality is an integer value ranging from 0 to 10 (inclusively). The mapping between quality and bit-rate is described in Fig. 4 for narrowband.

Once the initialization is done, for every input frame:

speex_bits_reset(&bits);

speex_encode_int(enc_state, input_frame, &bits);

nbBytes = speex_bits_write(&bits, byte_ptr, MAX_NB_BYTES);

where input_frame is a (short *) pointing to the beginning of a speech frame, byte_ptr is a (char *) where the encoded frame will be written, MAX_NB_BYTES is the maximum number of bytes that can be written to byte_ptr without causing an overflow and nbBytes is the number of bytes actually written to byte_ptr (the encoded size in bytes). Before calling speex_bits_write, it is possible to find the number of bytes that need to be written by calling speex_bits_nbytes(&bits), which returns a number of bytes.

It is still possible to use the speex_encode() function, which takes a (float *) for the audio. However, this would make an eventual port to an FPU-less platform (like ARM) more complicated. Internally, speex_encode() and speex_encode_int() are processed in the same way. Whether the encoder uses the fixed-point version is only decided by the compile-time flags, not at the API level.

After you're done with the encoding, free all resources with:

: speex_bits_destroy(&bits);
speex_encoder_destroy(enc_state);

That's about it for the encoder.

Decoding

In order to decode speech using Speex, you first need to:

: #include <speex/speex.h>

You also need to declare a Speex bit-packing struct

: SpeexBits bits;

and a Speex decoder state

: void *dec_state;

The two are initialized by:

: speex_bits_init(&bits);
dec_state = speex_decoder_init(&speex_nb_mode);

For wideband decoding, speex_nb_mode will be replaced by speex_wb_mode. If you need to obtain the size of the frames that will be used by the decoder, you can get that value in the frame_size variable (expressed in samples, not bytes) with:

: speex_decoder_ctl(dec_state, SPEEX_GET_FRAME_SIZE, &frame_size);

There is also a parameter that can be set for the decoder: whether or not to use a perceptual enhancer. This can be set by:

: speex_decoder_ctl(dec_state, SPEEX_SET_ENH, &enh);

where enh is an int with value 0 to have the enhancer disabled and 1 to have it enabled. As of 1.2-beta1, the default is now to enable the enhancer.

Again, once the decoder initialization is done, for every input frame:

: speex_bits_read_from(&bits, input_bytes, nbBytes);
speex_decode_int(dec_state, &bits, output_frame);

where input_bytes is a (char *) containing the bit-stream data received for a frame, nbBytes is the size (in bytes) of that bit-stream, and output_frame is a (short *) and points to the area where the decoded speech frame will be written. A NULL value as the second argument indicates that we don't have the bits for the current frame. When a frame is lost, the Speex decoder will do its best to "guess" the correct signal.

As for the encoder, the speex_decode() function can still be used, with a (float *) as the output for the audio.

After you're done with the decoding, free all resources with:

: speex_bits_destroy(&bits);
speex_decoder_destroy(dec_state);

Preprocessor

In order to use the Speex preprocessor, you first need to:

: #include <speex/speex_preprocess.h>

Then, a preprocessor state can be created as:

: SpeexPreprocessState *preprocess_state = speex_preprocess_state_init(frame_size, sampling_rate);

It is recommended to use the same value for frame_size as is used by the encoder (20 ms).

For each input frame, you need to call:

: speex_preprocess_run(preprocess_state, audio_frame);

where audio_frame is used both as input and output.

In cases where the output audio is not useful for a certain frame, it is possible to use instead:

: speex_preprocess_estimate_update(preprocess_state, audio_frame);

This call will update all the preprocessor internal state variables without computing the output audio, thus saving some CPU cycles.

The behaviour of the preprocessor can be changed using:

: speex_preprocess_ctl(preprocess_state, request, ptr);

which is used in the same way as the encoder and decoder equivalent. Options are listed in Section .

The preprocessor state can be destroyed using:

: speex_preprocess_state_destroy(preprocess_state);

Echo Cancellation

The Speex library now includes an echo cancellation algorithm suitable for Acoustic Echo Cancellation (AEC). In order to use the echo canceller, you first need to

: #include <speex/speex_echo.h>

Then, an echo canceller state can be created by:

: SpeexEchoState *echo_state = speex_echo_state_init(frame_size, filter_length);

where frame_size is the amount of data (in samples) you want to process at once and filter_length is the length (in samples) of the echo cancelling filter you want to use (also known as tail length). It is recommended to use a frame size in the order of 20 ms (or equal to the codec frame size) and make sure it is easy to perform an FFT of that size (powers of two are better than prime sizes). The recommended tail length is approximately the third of the room reverberation time. For example, in a small room, reverberation time is in the order of 300 ms, so a tail length of 100 ms is a good choice (800 samples at 8000 Hz sampling rate).

Once the echo canceller state is created, audio can be processed by:

: speex_echo_cancellation(echo_state, input_frame, echo_frame, output_frame);

where input_frame is the audio as captured by the microphone, echo_frame is the signal that was played in the speaker (and needs to be removed) and output_frame is the signal with echo removed.

One important thing to keep in mind is the relationship between input_frame and echo_frame. It is important that, at any time, any echo that is present in the input has already been sent to the echo canceller as echo_frame. In other words, the echo canceller cannot remove a signal that it hasn't yet received. On the other hand, the delay between the input signal and the echo signal must be small enough because otherwise part of the echo cancellation filter is inefficient. In the ideal case, you code would look like:

write_to_soundcard(echo_frame, frame_size);

read_from_soundcard(input_frame, frame_size);

speex_echo_cancellation(echo_state, input_frame, echo_frame, output_frame);

If you wish to further reduce the echo present in the signal, you can do so by associating the echo canceller to the preprocessor (see Section 5.3). This is done by calling:

: speex_preprocess_ctl(preprocess_state, SPEEX_PREPROCESS_SET_ECHO_STATE, echo_state);

in the initialisation.

As of version 1.2-beta2, there is an alternative, simpler API that can be used instead of speex_echo_cancellation(). When audio capture and playback are handled asynchronously (e.g. in different threads or using the poll() or select() system call), it can be difficult to keep track of what input_frame comes with what echo_frame. Instead, the playback comtext/thread can simply call:

: speex_echo_playback(echo_state, echo_frame);

every time an audio frame is played. Then, the capture context/thread calls:

: speex_echo_capture(echo_state, input_frame, output_frame);

for every frame captured. Internally, speex_echo_playback() simply buffers the playback frame so it can be used by speex_echo_capture() to call speex_echo_cancel(). A side effect of using this alternate API is that the playback audio is delayed by two frames, which is the normal delay caused by the soundcard. When capture and playback are already synchronised, speex_echo_cancellation() is preferable since it gives better control on the exact input/echo timing.

The echo cancellation state can be destroyed with:

: speex_echo_state_destroy(echo_state);

It is also possible to reset the state of the echo canceller so it can be reused without the need to create another state with:

: speex_echo_state_reset(echo_state);

Troubleshooting

There are several things that may prevent the echo canceller from working properly. One of them is a bug (or something suboptimal) in the code, but there are many others you should consider first

Using a different soundcard to do the capture and plaback will *not* work, regardless of what you may think. The only exception to that is if the two cards can be made to have their sampling clock ``locked'' on the same clock source.
The delay between the record and playback signals must be minimal. Any signal played has to ``appear'' on the playback (far end) signal slightly before the echo canceller ``sees'' it in the near end signal, but excessive delay means that part of the filter length is wasted. In the worst situations, the delay is such that it is longer than the filter length, in which case, no echo can be cancelled.
When it comes to echo tail length (filter length), longer is *not* better. Actually, the longer the tail length, the longer it takes for the filter to adapt. Of course, a tail length that is too short will not cancel enough echo, but the most common problem seen is that people set a very long tail length and then wonder why no echo is being cancelled.
Non-linear distortion cannot (by definition) be modeled by the linear adaptive filter used in the echo canceller and thus cannot be cancelled. Use good audio gear and avoid saturation/clipping.

Also useful is reading Echo Cancellation Demystified by Alexey Frunze, which explains the fundamental principles of echo cancellation. The details of the algorithm described in the article are different, but the general ideas of echo cancellation through adaptive filters are the same.

As of version 1.2beta2, a new echo_diagnostic.m tool is included in the source distribution. The first step is to define DUMP_ECHO_CANCEL_DATA during the build. This causes the echo canceller to automatically save the near-end, far-end and output signals to files (aec_rec.sw aec_play.sw and aec_out.sw). These are exactly what the AEC receives and outputs. From there, it is necessary to start Octave and type:

: echo_diagnostic('aec_rec.sw', 'aec_play.sw', 'aec_diagnostic.sw', 1024);

The value of 1024 is the filter length and can be changed. There will be some (hopefully) useful messages printed and echo cancelled audio will be saved to aec_diagnostic.sw . If even that output is bad (almost no cancellation) then there is probably problem with the playback or recording process.

Jitter Buffer

There are two jitter buffers. Both can be enabled by including:

: #include <speex/speex_jitter.c>

Generic Jitter Buffer

Speex Jitter Buffer

Resampler

As of version 1.2beta2, Speex includes a resampling modules. To make use of the resampler, it is necessary to include its header file:

: #include <speex/speex_resampler.h>

For each stream that is to be resampled, it is necessary to create a resampler state with:

: SpeexResamplerState *resampler;
resampler = speex_resampler_init(nb_channels, input_rate, output_rate, quality, &err);

where nb_channels is the number of channels that will be used (either interleaved or non-interleaved), input_rate is the sampling rate of the input stream, output_rate is the sampling rate of the output stream and quality is the requested quality setting (0 to 10). The quality parameter is useful for controlling the quality/complexity/latency tradeoff. Using a higher quality setting means less noise/aliasing, a higher complexity and a higher latency. Usually, a quality of 3 is acceptable for most desktop uses and quality 10 is mostly recommended for pro audio work. Quality 0 usually has a decent sound (certainly better than using linear interpolation resampling), but artifacts may be heard.

The actual resampling is performed using

: err = speex_resampler_process_int(resampler, channelID, in, &in_length, out, &out_length);

where channelID is the ID of the channel to be processed. For a mono stream, use 0. The in pointer points to the first sample of the input buffer for the selected channel and out points to the first sample of the output. The size of the input and output buffers are specified by in_length and out_length respectively. Upon completion, these values are replaced by the number of samples read and written by the resampler. Unless an error occurs, either all input samples will be read or all output samples will be written to (or both). For floating-point samples, the function speex_resampler_process_float() behaves similarly.

It is also possible to process multiple channels at once.

Codec Options (speex_*_ctl)

Entities should not be multiplied beyond necessity - William of Ockham.

Just because there's an option doesn't mean you have to use it - me.

The Speex encoder and decoder support many options and requests that can be accessed through the speex_encoder_ctl and speex_decoder_ctl functions. Despite that, the defaults are good for many applications and optional settings should only be used when one understands them and knows that they are needed. A common error is to attempt to set many unnecessary settings. These functions are similar to the ioctl system call and their prototypes are:

: void speex_encoder_ctl(void *encoder, int request, void *ptr);
void speex_decoder_ctl(void *encoder, int request, void *ptr);

The different values of request allowed are (note that some only apply to the encoder or the decoder):

SPEEX_SET_ENH**: Set perceptual enhancer to on (1) or off (0) (integer)
SPEEX_GET_ENH**: Get perceptual enhancer status (integer)
SPEEX_GET_FRAME_SIZE: Get the number of samples per frame for the current mode (integer)
SPEEX_SET_QUALITY*: Set the encoder speech quality (integer 0 to 10)
SPEEX_GET_QUALITY*: Get the current encoder speech quality (integer 0 to 10)
SPEEX_SET_MODE* $\dagger$: Use the source, Luke!
SPEEX_GET_MODE* $\dagger$: Use the source, Luke!
SPEEX_SET_LOW_MODE* $\dagger$: Use the source, Luke!
SPEEX_GET_LOW_MODE* $\dagger$: Use the source, Luke!
SPEEX_SET_HIGH_MODE* $\dagger$: Use the source, Luke!
SPEEX_GET_HIGH_MODE* $\dagger$: Use the source, Luke!
SPEEX_SET_VBR*: Set variable bit-rate (VBR) to on (1) or off (0) (integer)
SPEEX_GET_VBR*: Get variable bit-rate (VBR) status (integer)
SPEEX_SET_VBR_QUALITY*: Set the encoder VBR speech quality (float 0 to 10)
SPEEX_GET_VBR_QUALITY*: Get the current encoder VBR speech quality (float 0 to 10)
SPEEX_SET_COMPLEXITY*: Set the CPU resources allowed for the encoder (integer 1 to 10)
SPEEX_GET_COMPLEXITY*: Get the CPU resources allowed for the encoder (integer 1 to 10)
SPEEX_SET_BITRATE*: Set the bit-rate to use to the closest value not exceeding the parameter (integer in bps)
SPEEX_GET_BITRATE: Get the current bit-rate in use (integer in bps)
SPEEX_SET_SAMPLING_RATE: Set real sampling rate (integer in Hz)
SPEEX_GET_SAMPLING_RATE: Get real sampling rate (integer in Hz)
SPEEX_RESET_STATE: Reset the encoder/decoder state to its original state (zeros all memories)
SPEEX_SET_VAD*: Set voice activity detection (VAD) to on (1) or off (0) (integer)
SPEEX_GET_VAD*: Get voice activity detection (VAD) status (integer)
SPEEX_SET_DTX*: Set discontinuous transmission (DTX) to on (1) or off (0) (integer)
SPEEX_GET_DTX*: Get discontinuous transmission (DTX) status (integer)
SPEEX_SET_ABR*: Set average bit-rate (ABR) to a value n in bits per second (integer in bps)
SPEEX_GET_ABR*: Get average bit-rate (ABR) setting (integer in bps)
SPEEX_SET_PLC_TUNING*: Tell the encoder to optimize encoding for a certain percentage of packet loss (integer in percent)
SPEEX_GET_PLC_TUNING*: Get the current tuning of the encoder for PLC (integer in percent)
*: applies only to the encoder
**: applies only to the decoder
$\dagger$: normally only used internally

Mode queries

Speex modes have a query system similar to the speex_encoder_ctl and speex_decoder_ctl calls. Since modes are read-only, it is only possible to get information about a particular mode. The function used to do that is:

: void speex_mode_query(SpeexMode *mode, int request, void *ptr);

The admissible values for request are (unless otherwise note, the values are returned through ptr):

SPEEX_MODE_FRAME_SIZE: Get the frame size (in samples) for the mode
SPEEX_SUBMODE_BITRATE: Get the bit-rate for a submode number specified through ptr (integer in bps).

Preprocessor options

SPEEX_PREPROCESS_SET_DENOISE: Turns denoising on(1) or off(2) (integer)
SPEEX_PREPROCESS_GET_DENOISE: Get denoising status (integer)
SPEEX_PREPROCESS_SET_AGC: Turns automatic gain control (AGC) on(1) or off(2) (integer)
SPEEX_PREPROCESS_GET_AGC: Get AGC status (integer)
SPEEX_PREPROCESS_SET_VAD: Turns voice activity detector (VAD) on(1) or off(2) (integer)
SPEEX_PREPROCESS_GET_VAD: Get VAD status (integer)
SPEEX_PREPROCESS_SET_AGC_LEVEL
SPEEX_PREPROCESS_GET_AGC_LEVEL
SPEEX_PREPROCESS_SET_DEREVERB: Turns reverberation removal on(1) or off(2) (integer)
SPEEX_PREPROCESS_GET_DEREVERB: Get reverberation removal status (integer)
SPEEX_PREPROCESS_SET_DEREVERB_LEVEL
SPEEX_PREPROCESS_GET_DEREVERB_LEVEL
SPEEX_PREPROCESS_SET_DEREVERB_DECAY
SPEEX_PREPROCESS_GET_DEREVERB_DECAY
SPEEX_PREPROCESS_SET_PROB_START
SPEEX_PREPROCESS_GET_PROB_START
SPEEX_PREPROCESS_SET_PROB_CONTINUE
SPEEX_PREPROCESS_GET_PROB_CONTINUE
SPEEX_PREPROCESS_SET_NOISE_SUPPRESS: Set maximum attenuation of the noise in dB (negative number)
SPEEX_PREPROCESS_GET_NOISE_SUPPRESS: Get maximum attenuation of the noise in dB (negative number)
SPEEX_PREPROCESS_SET_ECHO_SUPPRESS: Set maximum attenuation of the residual echo in dB (negative number)
SPEEX_PREPROCESS_GET_ECHO_SUPPRESS: Set maximum attenuation of the residual echo in dB (negative number)
SPEEX_PREPROCESS_SET_ECHO_SUPPRESS_ACTIVE: Set maximum attenuation of the echo in dB when near end is active (negative number)
SPEEX_PREPROCESS_GET_ECHO_SUPPRESS_ACTIVE: Set maximum attenuation of the echo in dB when near end is active (negative number)
SPEEX_PREPROCESS_SET_ECHO_STATE: Set the associated echo canceller for residual echo suppression (NULL for no residual echo suppression)
SPEEX_PREPROCESS_GET_ECHO_STATE: Get the associated echo canceller

Packing and in-band signalling

Sometimes it is desirable to pack more than one frame per packet (or other basic unit of storage). The proper way to do it is to call speex_encode times before writing the stream with speex_bits_write. In cases where the number of frames is not determined by an out-of-band mechanism, it is possible to include a terminator code. That terminator consists of the code 15 (decimal) encoded with 5 bits, as shown in Table 4. Note that as of version 1.0.2, calling speex_bits_write automatically inserts the terminator so as to fill the last byte. This doesn't involves any overhead and makes sure Speex can always detect when there is no more frame in a packet.

It is also possible to send in-band ``messages'' to the other side. All these messages are encoded as ``pseudo-frames'' of mode 14 which contain a 4-bit message type code, followed by the message. Table 1 lists the available codes, their meaning and the size of the message that follows. Most of these messages are requests that are sent to the encoder or decoder on the other end, which is free to comply or ignore them. By default, all in-band messages are ignored.

Table 1: In-band signalling codes

Code	Size (bits)	Content
0	1	Asks decoder to set perceptual enhancement off (0) or on(1)
1	1	Asks (if 1) the encoder to be less ``agressive'' due to high packet loss
2	4	Asks encoder to switch to mode N
3	4	Asks encoder to switch to mode N for low-band
4	4	Asks encoder to switch to mode N for high-band
5	4	Asks encoder to switch to quality N for VBR
6	4	Request acknowloedge (0=no, 1=all, 2=only for in-band data)
7	4	Asks encoder to set CBR (0), VAD(1), DTX(3), VBR(5), VBR+DTX(7)
8	8	Transmit (8-bit) character to the other end
9	8	Intensity stereo information
10	16	Announce maximum bit-rate acceptable (N in bytes/second)
11	16	reserved
12	32	Acknowledge receiving packet N
13	32	reserved
14	64	reserved
15	64	reserved

Finally, applications may define custom in-band messages using mode 13. The size of the message in bytes is encoded with 5 bits, so that the decoder can skip it if it doesn't know how to interpret it.

Jean-Marc Valin 2007-05-23