Subsections
Programming with Speex (the libspeex API)
This section explains how to use the Speex API. Examples of code can
also be found in Appendix B and the complete
API documentation is included in the Documentation section of the
Speex website (http://www.speex.org/).
Encoding
In order to encode speech using Speex, one first needs to:
-
- #include <speex/speex.h>
Then a Speex bit-packing struct must be declared as:
-
- SpeexBits bits;
along with a Speex encoder state
-
- void *enc_state;
The two are initialized by:
-
- speex_bits_init(&bits);
enc_state = speex_encoder_init(&speex_nb_mode);
For wideband coding, speex_nb_mode will be replaced by speex_wb_mode.
In most cases, you will need to know the frame size used by the mode
you are using. You can get that value in the frame_size variable
(expressed in samples, not bytes) with:
-
- speex_encoder_ctl(enc_state,SPEEX_GET_FRAME_SIZE,&frame_size);
In practice, frame_size will correspond to 20 ms when using
8, 16, or 32 kHz sampling rate. There are many parameters that can
be set for the Speex encoder, but the most useful one is the quality
parameter that controls the quality vs bit-rate tradeoff. This is
set by:
-
- speex_encoder_ctl(enc_state,SPEEX_SET_QUALITY,&quality);
where quality is an integer value ranging from 0 to 10 (inclusively).
The mapping between quality and bit-rate is described in Fig. 4
for narrowband.
Once the initialization is done, for every input frame:
-
- speex_bits_reset(&bits);
speex_encode_int(enc_state, input_frame, &bits);
nbBytes = speex_bits_write(&bits, byte_ptr, MAX_NB_BYTES);
where input_frame is a (short *) pointing
to the beginning of a speech frame, byte_ptr is a (char
*) where the encoded frame will be written, MAX_NB_BYTES
is the maximum number of bytes that can be written to byte_ptr
without causing an overflow and nbBytes is the number of bytes
actually written to byte_ptr (the encoded size in bytes).
Before calling speex_bits_write, it is possible to find the number
of bytes that need to be written by calling speex_bits_nbytes(&bits),
which returns a number of bytes.
It is still possible to use the speex_encode() function, which
takes a (float *) for the audio. However, this would make
an eventual port to an FPU-less platform (like ARM) more complicated.
Internally, speex_encode() and speex_encode_int()
are processed in the same way. Whether the encoder uses the fixed-point
version is only decided by the compile-time flags, not at the API
level.
After you're done with the encoding, free all resources with:
-
- speex_bits_destroy(&bits);
speex_encoder_destroy(enc_state);
That's about it for the encoder.
Decoding
In order to decode speech using Speex, you first need to:
-
- #include <speex/speex.h>
You also need to declare a Speex bit-packing struct
-
- SpeexBits bits;
and a Speex decoder state
-
- void *dec_state;
The two are initialized by:
-
- speex_bits_init(&bits);
dec_state = speex_decoder_init(&speex_nb_mode);
For wideband decoding, speex_nb_mode will be replaced by
speex_wb_mode. If you need to obtain the size of the frames
that will be used by the decoder, you can get that value in the frame_size
variable (expressed in samples, not bytes) with:
-
- speex_decoder_ctl(dec_state, SPEEX_GET_FRAME_SIZE, &frame_size);
There is also a parameter that can be set for the decoder: whether
or not to use a perceptual enhancer. This can be set by:
-
- speex_decoder_ctl(dec_state, SPEEX_SET_ENH, &enh);
where enh is an int with value 0 to have the enhancer disabled
and 1 to have it enabled. As of 1.2-beta1, the default is now to enable
the enhancer.
Again, once the decoder initialization is done, for every input frame:
-
- speex_bits_read_from(&bits, input_bytes, nbBytes);
speex_decode_int(dec_state, &bits, output_frame);
where input_bytes is a (char *) containing the bit-stream
data received for a frame, nbBytes is the size (in bytes) of
that bit-stream, and output_frame is a (short *)
and points to the area where the decoded speech frame will be written.
A NULL value as the second argument indicates that we don't have the
bits for the current frame. When a frame is lost, the Speex decoder
will do its best to "guess" the correct signal.
As for the encoder, the speex_decode() function can still
be used, with a (float *) as the output for the audio.
After you're done with the decoding, free all resources with:
-
- speex_bits_destroy(&bits);
speex_decoder_destroy(dec_state);
Preprocessor
In order to use the Speex preprocessor, you first
need to:
-
- #include <speex/speex_preprocess.h>
Then, a preprocessor state can be created as:
-
- SpeexPreprocessState *preprocess_state = speex_preprocess_state_init(frame_size, sampling_rate);
It is recommended to use the same value for frame_size as
is used by the encoder (20 ms).
For each input frame, you need to call:
-
- speex_preprocess_run(preprocess_state, audio_frame);
where audio_frame is used both as input and output.
In cases where the output audio is not useful for a certain frame,
it is possible to use instead:
-
- speex_preprocess_estimate_update(preprocess_state, audio_frame);
This call will update all the preprocessor internal state variables
without computing the output audio, thus saving some CPU cycles.
The behaviour of the preprocessor can be changed using:
-
- speex_preprocess_ctl(preprocess_state, request, ptr);
which is used in the same way as the encoder and decoder equivalent.
Options are listed in Section .
The preprocessor state can be destroyed using:
-
- speex_preprocess_state_destroy(preprocess_state);
Echo Cancellation
The Speex library now includes an echo cancellation
algorithm suitable for Acoustic Echo Cancellation
(AEC). In order to use the echo canceller, you first need to
-
- #include <speex/speex_echo.h>
Then, an echo canceller state can be created by:
-
- SpeexEchoState *echo_state = speex_echo_state_init(frame_size, filter_length);
where frame_size is the amount of data (in samples) you
want to process at once and filter_length is the length
(in samples) of the echo cancelling filter you want to use (also known
as tail length). It is recommended to
use a frame size in the order of 20 ms (or equal to the codec frame
size) and make sure it is easy to perform an FFT of that size (powers
of two are better than prime sizes). The recommended tail length is
approximately the third of the room reverberation time. For example,
in a small room, reverberation time is in the order of 300 ms, so
a tail length of 100 ms is a good choice (800 samples at 8000 Hz sampling
rate).
Once the echo canceller state is created, audio can be processed by:
-
- speex_echo_cancellation(echo_state, input_frame, echo_frame, output_frame);
where input_frame is the audio as captured by the microphone,
echo_frame is the signal that was played in the speaker
(and needs to be removed) and output_frame is the signal
with echo removed.
One important thing to keep in mind is the relationship between input_frame
and echo_frame. It is important that, at any time, any echo
that is present in the input has already been sent to the echo canceller
as echo_frame. In other words, the echo canceller cannot
remove a signal that it hasn't yet received. On the other hand, the
delay between the input signal and the echo signal must be small enough
because otherwise part of the echo cancellation filter is inefficient.
In the ideal case, you code would look like:
-
- write_to_soundcard(echo_frame, frame_size);
read_from_soundcard(input_frame, frame_size);
speex_echo_cancellation(echo_state, input_frame, echo_frame, output_frame);
If you wish to further reduce the echo present in the signal, you
can do so by associating the echo canceller to the preprocessor
(see Section 5.3). This is done by calling:
-
- speex_preprocess_ctl(preprocess_state, SPEEX_PREPROCESS_SET_ECHO_STATE, echo_state);
in the initialisation.
As of version 1.2-beta2, there is an alternative, simpler API that
can be used instead of speex_echo_cancellation(). When audio
capture and playback are handled asynchronously (e.g. in different
threads or using the poll() or select() system call),
it can be difficult to keep track of what input_frame comes with
what echo_frame. Instead, the playback comtext/thread can simply
call:
-
- speex_echo_playback(echo_state, echo_frame);
every time an audio frame is played. Then, the capture context/thread
calls:
-
- speex_echo_capture(echo_state, input_frame, output_frame);
for every frame captured. Internally, speex_echo_playback()
simply buffers the playback frame so it can be used by speex_echo_capture()
to call speex_echo_cancel(). A side effect of using this
alternate API is that the playback audio is delayed by two frames,
which is the normal delay caused by the soundcard. When capture and
playback are already synchronised, speex_echo_cancellation()
is preferable since it gives better control on the exact input/echo
timing.
The echo cancellation state can be destroyed with:
-
- speex_echo_state_destroy(echo_state);
It is also possible to reset the state of the echo canceller so it
can be reused without the need to create another state with:
-
- speex_echo_state_reset(echo_state);
There are several things that may prevent the echo canceller from
working properly. One of them is a bug (or something suboptimal) in
the code, but there are many others you should consider first
- Using a different soundcard to do the capture and plaback will *not*
work, regardless of what you may think. The only exception to that
is if the two cards can be made to have their sampling clock ``locked''
on the same clock source.
- The delay between the record and playback signals must be minimal.
Any signal played has to ``appear'' on the playback (far end)
signal slightly before the echo canceller ``sees'' it in the near
end signal, but excessive delay means that part of the filter length
is wasted. In the worst situations, the delay is such that it is longer
than the filter length, in which case, no echo can be cancelled.
- When it comes to echo tail length (filter length), longer is *not*
better. Actually, the longer the tail length, the longer it takes
for the filter to adapt. Of course, a tail length that is too short
will not cancel enough echo, but the most common problem seen is that
people set a very long tail length and then wonder why no echo is
being cancelled.
- Non-linear distortion cannot (by definition) be modeled by the linear
adaptive filter used in the echo canceller and thus cannot be cancelled.
Use good audio gear and avoid saturation/clipping.
Also useful is reading Echo Cancellation Demystified by Alexey
Frunze, which explains the fundamental principles of echo cancellation.
The details of the algorithm described in the article are different,
but the general ideas of echo cancellation through adaptive filters
are the same.
As of version 1.2beta2, a new echo_diagnostic.m tool is
included in the source distribution. The first step is to define DUMP_ECHO_CANCEL_DATA
during the build. This causes the echo canceller to automatically
save the near-end, far-end and output signals to files (aec_rec.sw
aec_play.sw and aec_out.sw). These are exactly what the AEC receives
and outputs. From there, it is necessary to start Octave and type:
-
- echo_diagnostic('aec_rec.sw', 'aec_play.sw', 'aec_diagnostic.sw', 1024);
The value of 1024 is the filter length and can be changed. There will
be some (hopefully) useful messages printed and echo cancelled audio
will be saved to aec_diagnostic.sw . If even that output is bad (almost
no cancellation) then there is probably problem with the playback
or recording process.
There are two jitter buffers. Both can be enabled by including:
-
- #include <speex/speex_jitter.c>
As of version 1.2beta2, Speex includes a resampling modules. To make
use of the resampler, it is necessary to include its header file:
-
- #include <speex/speex_resampler.h>
For each stream that is to be resampled, it is necessary to create
a resampler state with:
-
- SpeexResamplerState *resampler;
resampler = speex_resampler_init(nb_channels, input_rate, output_rate, quality, &err);
where nb_channels is the number of channels that will be used (either
interleaved or non-interleaved), input_rate is the sampling rate
of the input stream, output_rate is the sampling rate of the output
stream and quality is the requested quality setting (0 to 10). The
quality parameter is useful for controlling the quality/complexity/latency
tradeoff. Using a higher quality setting means less noise/aliasing,
a higher complexity and a higher latency. Usually, a quality of 3
is acceptable for most desktop uses and quality 10 is mostly recommended
for pro audio work. Quality 0 usually has a decent sound (certainly
better than using linear interpolation resampling), but artifacts
may be heard.
The actual resampling is performed using
-
- err = speex_resampler_process_int(resampler, channelID, in, &in_length, out, &out_length);
where channelID is the ID of the channel to be processed. For a mono
stream, use 0. The in pointer points to the first sample of
the input buffer for the selected channel and out points to
the first sample of the output. The size of the input and output buffers
are specified by in_length and out_length respectively.
Upon completion, these values are replaced by the number of samples
read and written by the resampler. Unless an error occurs, either
all input samples will be read or all output samples will be written
to (or both). For floating-point samples, the function speex_resampler_process_float()
behaves similarly.
It is also possible to process multiple channels at once.
Codec Options (speex_*_ctl)
Entities should not be multiplied beyond necessity - William
of Ockham.
Just because there's an option doesn't mean you have to use
it - me.
The Speex encoder and decoder support many options and requests that
can be accessed through the speex_encoder_ctl and speex_decoder_ctl
functions. Despite that, the defaults are good for many applications
and optional settings should only be used when one understands
them and knows that they are needed. A common error is to attempt
to set many unnecessary settings. These functions are similar to the
ioctl system call and their prototypes are:
-
- void speex_encoder_ctl(void *encoder, int request, void *ptr);
void speex_decoder_ctl(void *encoder, int request, void *ptr);
The different values of request allowed are (note that some only apply
to the encoder or the decoder):
- SPEEX_SET_ENH**
- Set perceptual enhancer
to on (1) or off (0) (integer)
- SPEEX_GET_ENH**
- Get perceptual enhancer status (integer)
- SPEEX_GET_FRAME_SIZE
- Get the number of samples per frame for
the current mode (integer)
- SPEEX_SET_QUALITY*
- Set the encoder speech quality (integer
0 to 10)
- SPEEX_GET_QUALITY*
- Get the current encoder speech quality
(integer 0 to 10)
- SPEEX_SET_MODE*
- Use the source, Luke!
- SPEEX_GET_MODE*
- Use the source, Luke!
- SPEEX_SET_LOW_MODE*
- Use the source, Luke!
- SPEEX_GET_LOW_MODE*
- Use the source, Luke!
- SPEEX_SET_HIGH_MODE*
- Use the source, Luke!
- SPEEX_GET_HIGH_MODE*
- Use the source, Luke!
- SPEEX_SET_VBR*
- Set variable bit-rate (VBR) to on (1) or off
(0) (integer)
- SPEEX_GET_VBR*
- Get variable bit-rate
(VBR) status (integer)
- SPEEX_SET_VBR_QUALITY*
- Set the encoder VBR speech quality
(float 0 to 10)
- SPEEX_GET_VBR_QUALITY*
- Get the current encoder VBR speech
quality (float 0 to 10)
- SPEEX_SET_COMPLEXITY*
- Set the CPU resources allowed for the
encoder (integer 1 to 10)
- SPEEX_GET_COMPLEXITY*
- Get the CPU resources allowed for the
encoder (integer 1 to 10)
- SPEEX_SET_BITRATE*
- Set the bit-rate to use to the closest
value not exceeding the parameter (integer in bps)
- SPEEX_GET_BITRATE
- Get the current bit-rate in use (integer
in bps)
- SPEEX_SET_SAMPLING_RATE
- Set real sampling rate (integer in
Hz)
- SPEEX_GET_SAMPLING_RATE
- Get real sampling rate (integer in
Hz)
- SPEEX_RESET_STATE
- Reset the encoder/decoder state to its original
state (zeros all memories)
- SPEEX_SET_VAD*
- Set voice activity detection
(VAD) to on (1) or off (0) (integer)
- SPEEX_GET_VAD*
- Get voice activity detection (VAD) status
(integer)
- SPEEX_SET_DTX*
- Set discontinuous transmission
(DTX) to on (1) or off (0) (integer)
- SPEEX_GET_DTX*
- Get discontinuous transmission (DTX) status
(integer)
- SPEEX_SET_ABR*
- Set average bit-rate
(ABR) to a value n in bits per second (integer in bps)
- SPEEX_GET_ABR*
- Get average bit-rate (ABR) setting (integer
in bps)
- SPEEX_SET_PLC_TUNING*
- Tell the encoder to optimize encoding
for a certain percentage of packet loss (integer in percent)
- SPEEX_GET_PLC_TUNING*
- Get the current tuning of the encoder
for PLC (integer in percent)
- *
- applies only to the encoder
- **
- applies only to the decoder
-
- normally only used internally
Mode queries
Speex modes have a query system similar to the speex_encoder_ctl
and speex_decoder_ctl calls. Since modes are read-only, it is only
possible to get information about a particular mode. The function
used to do that is:
-
- void speex_mode_query(SpeexMode *mode, int request, void *ptr);
The admissible values for request are (unless otherwise note, the
values are returned through ptr):
- SPEEX_MODE_FRAME_SIZE
- Get the frame size (in samples) for
the mode
- SPEEX_SUBMODE_BITRATE
- Get the bit-rate for a submode number
specified through ptr (integer in bps).
Preprocessor options
- SPEEX_PREPROCESS_SET_DENOISE
- Turns denoising on(1) or off(2)
(integer)
- SPEEX_PREPROCESS_GET_DENOISE
- Get denoising status (integer)
- SPEEX_PREPROCESS_SET_AGC
- Turns automatic gain control (AGC)
on(1) or off(2) (integer)
- SPEEX_PREPROCESS_GET_AGC
- Get AGC status (integer)
- SPEEX_PREPROCESS_SET_VAD
- Turns voice activity detector (VAD)
on(1) or off(2) (integer)
- SPEEX_PREPROCESS_GET_VAD
- Get VAD status (integer)
- SPEEX_PREPROCESS_SET_AGC_LEVEL
-
- SPEEX_PREPROCESS_GET_AGC_LEVEL
-
- SPEEX_PREPROCESS_SET_DEREVERB
- Turns reverberation removal
on(1) or off(2) (integer)
- SPEEX_PREPROCESS_GET_DEREVERB
- Get reverberation removal status
(integer)
- SPEEX_PREPROCESS_SET_DEREVERB_LEVEL
-
- SPEEX_PREPROCESS_GET_DEREVERB_LEVEL
-
- SPEEX_PREPROCESS_SET_DEREVERB_DECAY
-
- SPEEX_PREPROCESS_GET_DEREVERB_DECAY
-
- SPEEX_PREPROCESS_SET_PROB_START
-
- SPEEX_PREPROCESS_GET_PROB_START
-
- SPEEX_PREPROCESS_SET_PROB_CONTINUE
-
- SPEEX_PREPROCESS_GET_PROB_CONTINUE
-
- SPEEX_PREPROCESS_SET_NOISE_SUPPRESS
- Set maximum attenuation
of the noise in dB (negative number)
- SPEEX_PREPROCESS_GET_NOISE_SUPPRESS
- Get maximum attenuation
of the noise in dB (negative number)
- SPEEX_PREPROCESS_SET_ECHO_SUPPRESS
- Set maximum attenuation
of the residual echo in dB (negative number)
- SPEEX_PREPROCESS_GET_ECHO_SUPPRESS
- Set maximum attenuation
of the residual echo in dB (negative number)
- SPEEX_PREPROCESS_SET_ECHO_SUPPRESS_ACTIVE
- Set maximum attenuation
of the echo in dB when near end is active (negative number)
- SPEEX_PREPROCESS_GET_ECHO_SUPPRESS_ACTIVE
- Set maximum attenuation
of the echo in dB when near end is active (negative number)
- SPEEX_PREPROCESS_SET_ECHO_STATE
- Set the associated echo canceller
for residual echo suppression (NULL for no residual echo suppression)
- SPEEX_PREPROCESS_GET_ECHO_STATE
- Get the associated echo canceller
Packing and in-band signalling
Sometimes it is desirable to pack more than one frame per packet (or
other basic unit of storage). The proper way to do it is to call speex_encode
times before writing the stream with speex_bits_write. In cases
where the number of frames is not determined by an out-of-band mechanism,
it is possible to include a terminator code. That terminator consists
of the code 15 (decimal) encoded with 5 bits, as shown in Table 4.
Note that as of version 1.0.2, calling speex_bits_write automatically
inserts the terminator so as to fill the last byte. This doesn't involves
any overhead and makes sure Speex can always detect when there is
no more frame in a packet.
It is also possible to send in-band ``messages'' to the other
side. All these messages are encoded as ``pseudo-frames'' of mode
14 which contain a 4-bit message type code, followed by the message.
Table 1 lists the available codes,
their meaning and the size of the message that follows. Most of these
messages are requests that are sent to the encoder or decoder on the
other end, which is free to comply or ignore them. By default, all
in-band messages are ignored.
Table 1:
In-band signalling codes
Code |
Size (bits) |
Content |
0 |
1 |
Asks decoder to set perceptual enhancement off (0) or on(1) |
1 |
1 |
Asks (if 1) the encoder to be less ``agressive'' due to high packet
loss |
2 |
4 |
Asks encoder to switch to mode N |
3 |
4 |
Asks encoder to switch to mode N for low-band |
4 |
4 |
Asks encoder to switch to mode N for high-band |
5 |
4 |
Asks encoder to switch to quality N for VBR |
6 |
4 |
Request acknowloedge (0=no, 1=all, 2=only for in-band data) |
7 |
4 |
Asks encoder to set CBR (0), VAD(1), DTX(3), VBR(5), VBR+DTX(7) |
8 |
8 |
Transmit (8-bit) character to the other end |
9 |
8 |
Intensity stereo information |
10 |
16 |
Announce maximum bit-rate acceptable (N in bytes/second) |
11 |
16 |
reserved |
12 |
32 |
Acknowledge receiving packet N |
13 |
32 |
reserved |
14 |
64 |
reserved |
15 |
64 |
reserved |
|
Finally, applications may define custom in-band messages using mode
13. The size of the message in bytes is encoded with 5 bits, so that
the decoder can skip it if it doesn't know how to interpret it.
Jean-Marc Valin
2007-05-23