Next: Problems with the Existing
Up: SOUND SOURCE SEPARATION OF
Previous: Introduction
The DUET system
The current system is an extension of the DUET system. To begin,
we consider the signal model used therein. This signal model
assumes
sources in 2 channels (stereo input signals). It
claims that the left channel contains each of
sources,
, in their ``original'' forms, and that the right
channel contains delayed and scaled versions of these same
signals. Naming the left channel
and the right channel
, we may write this in the frequency domain as
where
represents the scale parameter and
represents the delay parameter, each for some source
from the
left to right channel. We note that the term ``delay'' suggests
that the signal arrives in the left channel before the right. In
fact, we allow this parameter to be negative, in which case some
source signal arrives in the right channel before the left.
Similarly, the scale parameters may be greater than 1, implying
that the corresponding source signal is louder in the right
channel than the left. We refer to
and
together
as the mixing parameters for a given source
.
To proceed to sound source separation, the authors now rely on an
assumption they refer to as W-disjoint orthogonality. This states
that at every point in time-frequency space, no more than one
source has positive energy. In practical terms, this means that
in a conventional frame-by-frame analysis system, each bin in any
given frame corresponds to no more than one source
. The
authors claim [3] that this assumption
approximately holds for mixtures of speech.
Given this model and assumption, the DUET system estimates the
delay and scale parameters for each frequency bin
in
each frame
, via:
 |
|
|
(1) |
Having done this for the non-redundant frequencies in each of
frames, each of length
, we now have
pairs of mixing parameter estimates. Due to approximate
W-disjoint orthogonality, we may assume that most of the estimates
correspond to exactly one source, though we do not know which
source. To determine this, the DUET system creates a
two-dimensional histogram in normalized
space and
analyzes it for peaks. If there are
sources, we expect to see
histogram peaks, since we expect clusters of parameter
estimates around the true mixing parameter values. A variety of
histogram bin sizes, smoothing windows, and FFT bin weighting
schemes may be used to make the histograms more indicative of the
actual parameters [3,1].
By picking peaks in the two-dimensional histogram, then, the DUET
system determines both the number of sources
and their
corresponding mixing parameters. Once this has been done, the
system goes through the mixing parameter data obtained initially
via equation 1 again, and assigns each bin in each
frame to the source whose mixing parameters were the nearest
neighbor to those estimated from the histogram. By IFFT
overlap-add synthesis, the DUET system then reconstructs estimates
of the original sources.
Subsections
Next: Problems with the Existing
Up: SOUND SOURCE SEPARATION OF
Previous: Introduction
Aaron S. Master
2003-03-27