next up previous
Next: Problems with the Existing Up: SOUND SOURCE SEPARATION OF Previous: Introduction


The DUET system

The current system is an extension of the DUET system. To begin, we consider the signal model used therein. This signal model assumes $N$ sources in 2 channels (stereo input signals). It claims that the left channel contains each of $N$ sources, $S_1,
S_2, ... S_N$, in their ``original'' forms, and that the right channel contains delayed and scaled versions of these same signals. Naming the left channel $X_1$ and the right channel $X_2$, we may write this in the frequency domain as

\begin{eqnarray*}
X_1 &=& S_1 + S_2 + \cdots + S_N \\
X_2 &=& a_1 e^{-j\ome...
... e^{-j\omega\delta_2}S_2 + \cdots + a_N e^{-j\omega\delta_N}S_N
\end{eqnarray*}



where $a_i$ represents the scale parameter and $\delta_i$ represents the delay parameter, each for some source $i$ from the left to right channel. We note that the term ``delay'' suggests that the signal arrives in the left channel before the right. In fact, we allow this parameter to be negative, in which case some source signal arrives in the right channel before the left. Similarly, the scale parameters may be greater than 1, implying that the corresponding source signal is louder in the right channel than the left. We refer to $a_i$ and $\delta_i$ together as the mixing parameters for a given source $i$. To proceed to sound source separation, the authors now rely on an assumption they refer to as W-disjoint orthogonality. This states that at every point in time-frequency space, no more than one source has positive energy. In practical terms, this means that in a conventional frame-by-frame analysis system, each bin in any given frame corresponds to no more than one source $S_i$. The authors claim [3] that this assumption approximately holds for mixtures of speech. Given this model and assumption, the DUET system estimates the delay and scale parameters for each frequency bin $\omega_k$ in each frame $\tau$, via:
$\displaystyle (a_i,\delta_i)=
\left(\ensuremath{\frac{\vert X_2(\omega_k,\tau)\...
...{\frac{X_1(\omega_k,\tau)}{X_2(\omega_k,\tau)}}\right)\right\}/\omega_k\right).$     (1)

Having done this for the non-redundant frequencies in each of $L$ frames, each of length $N_{fft}$, we now have $L \cdot N_{fft}/2$ pairs of mixing parameter estimates. Due to approximate W-disjoint orthogonality, we may assume that most of the estimates correspond to exactly one source, though we do not know which source. To determine this, the DUET system creates a two-dimensional histogram in normalized $(a,\delta)$ space and analyzes it for peaks. If there are $N$ sources, we expect to see $N$ histogram peaks, since we expect clusters of parameter estimates around the true mixing parameter values. A variety of histogram bin sizes, smoothing windows, and FFT bin weighting schemes may be used to make the histograms more indicative of the actual parameters [3,1]. By picking peaks in the two-dimensional histogram, then, the DUET system determines both the number of sources $N$ and their corresponding mixing parameters. Once this has been done, the system goes through the mixing parameter data obtained initially via equation 1 again, and assigns each bin in each frame to the source whose mixing parameters were the nearest neighbor to those estimated from the histogram. By IFFT overlap-add synthesis, the DUET system then reconstructs estimates of the original sources.

Subsections
next up previous
Next: Problems with the Existing Up: SOUND SOURCE SEPARATION OF Previous: Introduction
Aaron S. Master 2003-03-27