Audio Source Separation with Discriminative Scattering Networks
Pablo Sprechmann$^1$, Joan Bruna$^2$, Yann Lecun$^{1,2}$, $^1$ NYU, Courant Institute of Mathematical Sciences, $^2$ Facebook AI Research.
Abstract
In this report we describe an ongoing line of research for solving single-channel source separation problems. Many monaural signal decomposition techniques proposed in the literature operate on a feature space consisting of a time-frequency representation of the input data. A challenge faced by these approaches is to effectively exploit the temporal dependencies of the signals at scales larger than the duration of a time-frame. In this work we propose to tackle this problem by modeling the signals using a time-frequency representation with multiple temporal resolutions. The proposed representation consists of a pyramid of wavelet scattering operators, which generalizes Constant Q Transforms (CQT) with extra layers of convolution and complex modulus. We first show that learning standard models with this multi-resolution setting improves source separation results over fixed-resolution methods. As study case, we use Non-Negative Matrix Factorizations (NMF) that has been widely considered in many audio application. Then, we investigate the inclusion of the proposed multi-resolution setting into a discriminative training regime. We discuss several alternatives using different deep neural network architectures. % as an initial study.
Introduction
Monaural Source Separation is a fundamental inverse problem in speech processing (Loizou (2007); H¨ ansler & Schmidt (2008)). Successful algorithms rely on m odels that capture signal regularity while preserving discrimination between different speakers. The decomposition of time-frequency representations, such as the power or magnitude spectrogram in terms of elementary atoms of a dictionary, has become a popularcitet tool in audio processing. Non-negative matrix factorization (NMF) (Lee & Seung (1999)), have been widely adopted in various audio processing tasks, including in particular source separation, see Smaragdis et al. (2014) for a recent review. There are many works that follow this line in speech separation (Schmidt & Olsson (2006); Shashanka et al. (2007)) and enhancement (Duan et al. (2012); Mohammadiha et al. (2013)).
Although NMF applied on spectral features is highly efficient, it fails to model long range geometrical features that characterize speech signals. Increasing the temporal window is not the solution, since it increases significantly the dimensionality of the problem and reduces the discriminative power of the model. In order to overcome this limitation, many works have proposed regularized extensions of NMF to promote learned structure in the codes. Examples of these approaches are, temporal smoothness of the activation coefficients (F´ evotte (2011)), including co-occurrence statistics of the basis functions (Wilson et al. (2008)), and learned temporal dynamics with Kalman filtering like techniques(Mysore & Smaragdis (2011); Han et al. (2012); F´ evotte et al. (2013)) or integrating Recurrent Neural Networks (RNN) into the NMF framework (Boulanger-Lewandowski et al. (2014)).
More recently, several works have observed that the efficiency of these methods can be improved with discriminative training. Discriminatively trained dictionary learning techniques (Mairal et al. (2012); Gregor & LeCun (2010); Sprechmann et al. (2014); Weninger et al. (2014a)) show the importance of adapting the modeling task to become discriminative at the inverse problem at hand.
A number of works completely bypass the modeling aspect and approach inverse problems as non-linear regression problems using Deep Neural Networks(DNN) (Sprechmann et al. (2013); Schuler et al. (2014); Dong et al. (2014)) with differet levels of structure ranging from simple frameby-frame regressors to more sophisticated RNN. Applications include source separation in music (Sprechmann et al. (2012); Huang et al. (2014b)), speech separation (Huang et al. (2014a)) and speech enhancement (Weninger et al. (2014b)).
The goal of this work is to show that using stable and robust multi-resolution representation of the data can benefit the sources separation algorithms in both discriminative and non-discriminative settings. Previous works have shown that the choice of the input features plays a very important role of on source separation (Weninger et al. (2014b)) and speech recognition (Mohamed et al. (2012)). This work takes this observation a step further to the multi-resolution setting.
We consider a deep representation based on the wavelet scattering pyramid, which produces information at different temporal resolutions and defines a metric which is increasingly contracting. This representation can be thought as a generalization of the CQT. Discriminative features having longer temporal context can be constructed with the scattering transform (Bruna & Mallat (2013b)) and have been sucessfully applied to audio signals by And´ en & Mallat (2013). While these features have shown excellent performance in various classification tasks, in the context of source separation we require a representation that not only captures long-range temporal structures, but also preserves as much temporal discriminability as possible.
For the non-discriminative setting, we present an extension of the NMF framework to the pyramid representation. We learn NMF models at different levels of the hierarchy. While NMF dictionaries at the first level are very selective to temporally localized energy patterns, deeper layers provide additional modeling of the longer temporal dynamics (Bruna et al. (2014)). For the discriminative setting we discuss a number of baseline models based on neural networks. As a proof of concept, we evaluate both settings on a multi-speaker speech separation task. We observe that in both training regimes the multi-resolution setting leads to better performance with respect to the baselines. We also confirm with experiments the superiority of discriminative approaches.
The paper is organized as follows. In Section 2 we describe the general setting of source separation and review some baseline solutions for both in training regimes. We present the proposed representation in Sections 3 and show how it can be used in the context of source separation in Section 4. We show some initial experimental results in Section 5 and a discussion is given in Section 6.
Single-channel source separation
In this work we are intereseted in the families of algorithms that solve source separation on a feature space. This section is dedicated to describing different alternatives that fall in this category. We first introduce the general setting in Section 2.1. In Section 2.2 we describe the popular NMF framework and different training regimes employed with it. Finally we discuss purely discriminative approaches based on deep networks in Section 2.3.
Problem formulation
We consider the setting in which we observe a temporal signal y ( t ) that is the sum of two sources x i ( t ) , with i = 1 , 2 ,
$$
$$
and we aim at finding estimates ˆ x i ( t ) . We consider the supervised monoaural source separation problem, in which the components x i , i = 1 , 2 come from sources for which we have representative training data. In this report we concentrate to the case of speech signals, but other alternatives could be considered, such as noise or music.
Most recent techniques typically operate on a non-negative time-frequency representation. Let us denote as Φ( y ) ∈ R m × n the transformed version of y ( t ) , comprising m frequency bins and n temporal frames. This transform can be thought as a non-linear analysis operator and is typically defined as the magnitude (or power) of a time-frequency representation such as the Short-Time Fourier Transform (STFT). Other robust alternatives have also been explored (Huang et al. (2014a);
Weninger et al. (2014b)). In all cases, the temporal resolution of the features is fixed and given by the frame duration.
Performing the separation in the non-linear representation is key to the success of these algorithms. The transformed domain is in general invariant to some irrelevant variability of the signals (such as local shifts), thus relieving the algorithms from learning it. This comes at the expense of inverting the unmixed estimates in the feature space, normally known as the phase recovery problem (Gerchberg & Saxton (1972)). Specifically, these algorithms take Φ( y ) as input and produce estimates for each source, Φ(ˆ x i ) with i = 1 , 2 . The phase recovery problem corresponts to finding signals ˆ x ′ i such matching the obtained features Φ(ˆ x i ) and satisfying y = ˆ x ′ 1 + ˆ x ′ 2 .
The most common choice is to use the magintud (or power) STFT as the feature space. In this case, the phase recovery problem can be solved very efficiently using soft masks to filter the mixture signal (Schmidt et al. (2007)). The strategy resembles Wiener filtering and has demonstrated very good results in practice. Specifically, Φ( y ) = |S{ y }| , where S{ y } ∈ C m × n is a complex matrix corresponding to the STFT. The estimated unmixed signals are obtained by filtering the mixture,
$$
$$
where multiplication denoted ◦ , division, and exponentials are element-wise operations. The parameter p defines the smoothness of the mask, we use p = 2 in our experiments. Note that this solution automatically imposes the consistency restriction y = ˆ x ′ 1 + ˆ x ′ 2 .
Non-negative matrix factorization
Source separation methods based on matrix factorization approaches have received a lot of attention in the literature in recent years. NMF-based source separation techniques attempt to find the nonnegative activations Z i ∈ R q × n , i = 1 , 2 best representing the different speech components in two dictionaries D i ∈ R m × q . Ideally one would want to solve the problem,
$$
$$
where the first term in the optimization objective measures the dissimilarity between the input data and the estimated channels in the feature space. Common choices of D are the squared Euclidean distance, the Kullback-Leibler divergence, and the Itakura-Saito divergence. The second term in the minimization objective is included to promote some desired structure of the activations. This is done using a designed regularization function R , whose relative importance is controlled by the parameters λ . In this work we use D reweighted squared Euclidean distance and the /lscript 1 norm as the regularization function R .
Problem (3) could be minimized with an alternating gradient descent between x ′ i and z i . Note that fixing z i and minimizing with respect to x ′ i requires locally inverting the transform Φ , which amounts to solve an overcomplete phase recovery problem. In practice, a greedy proxy of (3) is solved instead. First a separation is obtained in the feature spaces by solving a clasic NMF problem,
$$
$$
for which there exist standard optimization algorithms, see for example F´ evotte & Idier (2011). Once the optimal activations are solved for, the spectral envelopes of the speech are estimated as Φ( ˆ x i ) = D i Z i , and the phase recovery is solved using (2).
In this supervised setting, the dictionaries are obtained from training data. The classic approach is to build model for each source independently and later use them together at testing time. Many works have observed that sparse coding inference algorithms can be improved in specific tasks by using discriminative training, i.e. by directly optimizing the parameters of the model on the evaluation cost function. Task-aware (or discriminative) sparse modeling is elegantly described by Mairal et al. (2012), observing that one can back-propagate through the Lasso. These ideas have been used in the context of source separation and enhancement (Sprechmann et al. (2014); Weninger et al. (2014a)). The goal is to obtain dictionaries such that the solution of (4) also minimizes the reconstruction
given the ground truth separation,
$$
$$
where Z ∗ i are the solutions of (4) (and depend on the dictionaries) and α is a parameter controlling the relative importance of source recovery; typically, one would set α = 0 in a denoising application (where the second signal is noise), and α = 1 in a source separation application where both signals need to be recovered. When the phase recovery can be obtained using the masking scheme described in Section 2.1, it could be included into the objective in order to directly optimize the signal reconstruction in the time domain. While the discriminative setting is a better target, the estimation needs to be computed over the product set rather than each training set independently and the generalization might be compromised when small training sets are available. It is important to note that the level of supervision is very mild, as in the training of autoencoders. We are artificially generating the mixtures, and consequently obtaining the ground truth.
The standard NMF approaches treat different time-frames independently, ignoring the temporal dynamics of the signals. As described in Section 1, many works attempt to change the regularization function R in order integrate several frames into de decomposition. It's analysis and description is outside the scope of this report.
Purely discriminative settings
With the mindset of the discriminative learning, one is tempted to simply replace the inference step by a generic neural network architecture, having enough capacity to perform non-linear regression. The systems are trained as to minimize a measure of fitness between the ground truth separation and the output as in (5), being the most common the Mean Squared Error (MSE). Not that this can be performed in the feature space or in the time domain (when the phase recovery is simple). Other alternatives studied in the literature consist of predicting the masks given in (3) as described by Huang et al. (2014a).
The most straight forward choice is to perform the estimation using a DNN on a fixed time scale. Using a short temporal context fails to model long range temporal dependencies on the speech signals, while increasing the context renders the regression problem intractable. One could consider to train a DNN on an set of several frames (Weninger et al. (2014b)). Recent works have explored neural network architectures that exploit temporal context such as RNN and Long Short-Term Memory (LSTM) (Huang et al. (2014a); Weninger et al. (2014b)).
Pyramid Wavelet Scattering
In this section we present briefly the proposed wavelet scattering pyramid, which is conceptually similar to standard scattering networks introduced by Mallat (2010), but creates features at different temporal resolutions at every layer.
Wavelet Filter Bank
A wavelet ψ ( t ) is a band-pass filter with good frequency and spatial localization. We consider a complex wavelet with a quadrature phase, whose Fourier transform satisfies F ψ ( ω ) ≈ 0 for ω < 0 . We assume that the center frequency of F ψ is 1 and that its bandwidth is of the order of Q -1 . Wavelet filters centered at the frequencies λ = 2 j/Q are computed by dilating ψ : ψ λ ( t ) = λψ ( λt ) , and hence F ψ λ ( ω ) = ̂ ψ ( λ -1 ω ) . We denote by Λ the index set of λ = 2 j/Q over the signal frequency support, with j ≤ J 1 . The resulting filter bank has a constant number Q of bands per octave and J 1 octaves. Let us define φ 1 ( t ) as a low-pass filter with bandwidth 2 -J 1 . The wavelet transform of a signal x ( t ) is
$$
$$
Since the bandwidth of all filters is at most Q 1 , we can down-sample its outputs with a stride Q .
Pyramid Scattering Transform
Instead of using a fixed bandwidth smoothing kernel that is applied at all layers, we sample at critical rate in order to preserve temporal locality as much as possible. We start by removing the complex phase of wavelet coefficients in Wx with a complex modulus nonlinearity. Then, we arrange these first layer coefficients as nodes in the first level of a tree. Each node of this tree is down sampled at the critical sampling rate of the layer ∆ 1 , given by the reciprocal of the largest bandwidth present in the filter bank:
$$
$$
These first layer coefficients give localized information both in time and frequency, with a trade-off dictated by the Q factor. They are however sensitive to local time-frequency warps, which are often uninformative. In order to increase the robustness of the representation, we transform each of the down sampled signals with a new wavelet filter bank and take the complex modulus of the oscillatory component. For simplicity, we assume a dyadic transformation, which reduces the filter bank to a pair of conjugate mirror filters { φ 2 , ψ 2 } (Mallat (1999)), carrying respectively the low-frequencies and high-frequencies of the discrete signal from above the tree:
$$
$$
Every layer thus produces new feature maps at a lower temporal resolution. As shown in Bruna & Mallat (2013b), only coefficients having gone through m ≤ m max non-linearities are in practice computed, since their energy quickly decays. We fix m max = 2 in our experiments.
We can reapply the same operator as many times k as desired until reaching a temporal context T = 2 k ∆ 1 . If the wavelet filters are chosen such that they define a non-expansive mapping Bruna & Mallat (2013b), it results that every layer defines a metric which is increasingly contracting:
$$
$$
Every layer thus produces new feature maps at a lower temporal resolution. In the end we obtain a tree of different representations, Φ j ( x ) = | W j | x with j = 1 , . . . , k .
Source Separation Algorithms
In this section we show a few examples of how the proposed pyramid scattering features could be used for solving the source separation problem. We present alternatives for both learning paradigms: non-discriminative and discriminative.
Non-Discriminative Training
In this setting, we try to find models for each speaker using the features of the wavelet scattering pyramid. Each layer of the transform produces information with different stability/discriminability trade-offs. Whereas in typical classification applications one is mostly interested in choosing a single layer which provides the best trade-off given the intrinsic variability of the dataset, in inverse problems we can leverage signal models at all levels. Let us suppose two different sources X 1 and X 2 , and let us consider for simplicity the features Φ j ( x i ) , j = 1 , 2 , i = 1 , 2 , x i ∈ X i , obtained by localizing the scattering features of two different resolutions at their corresponding sampling rates. Therefore, Φ 1 carries more discriminative and localized information than Φ 2 .
In the non-discriminative training, we train independent models for each source. Given training examples X t i from each source, we consider a NMF of each of the features Φ j ( x t i ) :
$$
$$
where here the parameters λ j i control the sparsity-reconstruction trade-off in the sparse coding. In our experiments we used a fixed value for all of them λ 2 i = λ . At test time, given y = x 1 + x 2 , we
estimate ˆ x 1 , ˆ x 2 as the solution of
$$
$$
Problem (6) is a coupled phase recovery problem under linear constraints. It can be solved using gradient descent as in Bruna & Mallat (2013a), but in our setting we use a greedy algorithm, which approximates the unknown complex phases using the phase of W 1 y and W 2 | W 1 y | respectively. Similarly as in Weninger et al. (2014b), we simplify the inference by using a stronger version of the linear constraint y = x 1 + x 2 , namely
$$
$$
and therefore that destructive interferences are negligible.
Discriminative Training
The pyramid scattering features can also be used to train end-to-end models. The most simple alternative is to train a DNN directly from features having the same temporal context as second layer scattering features. For simplicity, we replace the second layer of complex wavelets and modulus with a simple Haar transform:
$$
$$
where h k is the Haar wavelet at scale 2 k , and we feed this feature into a DNN with the same number of hidden units as before. We do not take the absolute value as in standard scattering to leave the chance to the DNN to recombine coefficients before the first non-linearity. We report results for J 2 = 5 which corresponds to a temporal context of 130 ms. We will refer to this alternative as DNNmulti . As a second example, we also consider a multi-resolution Convolutional Neural Network (CNN), constructed by creating contexts of three temporal frames at resolutions 2 j , j = 0 . . . , J 2 = 5 . We will refer to this alternative as CNN-multi . This setting has the same temporal context as the DNN-multi but rather than imposing separable filters we leave extra freedom. This architecture can access relatively large temporal context with a small number of learnable parameters. Since the phse recovery problem cannot be approximated with softmax as in (3), we use as the cost function the MSE of the reconstructed feature at all resolutions.
Experiments
In this section we present some initial experimental evaluation in which we study the use of multi resolution signal representation with both discriminative and non-discriminative training regimes. We compare the performance against some basic baseline settings.
As a proof of concept, we evaluated the different alternatives in a multi-speaker setting in which we aim at separating male and female speech. In each case, we trained two gender-specific modeles. The training data consists of recordings of a generic group of speakres per gender, none of which were included in the test set. The experiments were carried out on the TIMIT corpus. We adopted the standard test-train division, using all the training recordings (containing 462 different speakers) for building the models and a subset of 12 different speakers (6 males and 6 females) for testing. For each speaker we randomly chose two clips and compared all female-male combinations (144 mixtures). All signals where mixed at 0 dB and resampled to 16 kHz. We used the source-todistortion ratio (SDR), source-to-interference ratio (SIR), and source-to-artifact ratio (SAR) from the BSS-EVAL metrics (Vincent et al. (2006)). We report the average over the both speakers, as the measure are not symmetric.
Non-discriminative settings: As a basline for the non-discriminative setting we used standard NMFwith STFT of frame lengths of 1024 samples and 50% overlap, leading to 513 feature vectors. The dictionaries were chosen with 200 and 400 atoms. We evaluated the proposed scattering features in combination with NMF (as described in Section 4.1) with one and two layers, referred as scattNMF1 and scatt-NMF2 respectively. We use complex Morlet wavelets with Q 1 = 32 voices per octave in the first level, and dyadic Morlet wavelets ( Q 2 = 1 ) for the second level, for a review on Morlet wavelets refer to Mallat (1999). The resulting representation had 175 coefficients for the first
Table 1: Source separation results on a multi-speaker settings. Average SDR, SIR and SAR (in dB ) for different methods. Standard deviation of each result shown between brackets.
level and around 2000 for the second layer. We used 400 atoms for scatt-NMF1 and 1000 atoms for scatt-NMF2 . In all cases, the features were frame-wise normalized and we used λ = 0 . 1 . In all cases, parameters were obtained using cross-validation on a few clips separated from the training as a validation set.
Discriminative settings: We use a single and multi-frame DNN s as a baseline for this training setting.The network architectures consist of two hidden layers using the outputs of the first layer of scattering, that is, the CQT coefficients at a given temporal position. It uses RELU's as in the rest of the architectures and the output is normalize so that it corresponds to the spectral mask discussed in (3). The multi-frame version considers the concatenation of 5 frames as inputs matching the temporal context of the tested multi-resolution versions. We used 512 and 150 units for the singleframe DNN (referred as CQT-DNN ) and 1024 and 512 for the multi-frame one (referred as CQTDNN-5 ), increasing the number of parameters did not improve the results. We optimize the network to optimize the MSE to each of the sources. We also include the architectures DNN-multi and CNN-multi described in Section 4.2. In all cases the weights are randomly initialized and training is performed using stochastic gradient descent with momentum. We used the GPU-enabled package Matconvnet (Simonyan & Zisserman (2014)).
Table 1 shows the results obtained for the speaker-specific and multi-speaker settings. In all cases we observe that the one layer scattering transform outperforms the STFT in terms of SDR. Furthermore, there is a tangible gain in including a deeper representation; scatt-NMF2 performs always better than scatt-NMF1 . While the gain in the SDR and SAR are relatively small the SIR is 3dB higher. It is thus benefitial to consider a longer temporal context in order to perform the separation sucessfully.
Onthe other hand, as expected, the discriminative training yields very significant improvements. The same reasons that produced the improvements in the non-discriminative setting also have an impact in the discriminative case. Adding enough temporal contexts to the neural regressors improves their performance. The multi-temporal representation plays a key role as simply augmenting the number of frames does not lead to better performance (at least using baseline DNNs). It remains to be seen how these architectures would compare with the alternative RNN models.
Discussion
We have observed that the performance of baseline source separation algorithms can be improved by using a temporal multi-resolution representation. The representation is able to integrate information across longer temporal contexts while removing uninformative variability with a relatively low parameter budget. In line with recent findings in the literature, we have observed that including discriminative criteria in the training leads to significant improvements in the source separation performance. However, contrary to standard sparse modeling in which the resulting inference can be readily approximated with a neural network, it remains unclear whether phase-recovery type inference can also be efficiently approximated with neural network architectures. We believe there might still be a gap in performance that might be bridged with appropriate discriminative architectures.
While this report presents shows some promising initial results, several interesting comparisons need to be made and are subject of current research. We consider an interesting problem exploring the best way of including the long-term temporal consistency into the estimation. Recent studies have evaluate the use of deep RNN's for solving the source separation problem Huang et al. (2014a); Weninger et al. (2014b). While Huang et al. (2014a) do not observe significant improvements over
standard DNN's in speech separation, Weninger et al. (2014b) obtain significant improvements using LSTM-DRNN in speech enhancement. We are currently addressing the question of comparing different neural network architectures that exploit temporal dependancies and assessing whether the use of multi-resolution representation can play a role as in this initial study.
Previous Work
Task oriented Phase Recovery
Experiments
In this section we present some initial experimental evaluation in which we study the use of multi resolution signal representation with both discriminative and non-discriminative training regimes. We compare the performance against some basic baseline settings.
As a proof of concept, we evaluated the different alternatives in a multi-speaker setting in which we aim at separating male and female speech. In each case, we trained two gender-specific modeles. The training data consists of recordings of a generic group of speakres per gender, none of which were included in the test set. The experiments were carried out on the TIMIT corpus. We adopted the standard test-train division, using all the training recordings (containing 462 different speakers) for building the models and a subset of 12 different speakers (6 males and 6 females) for testing. For each speaker we randomly chose two clips and compared all female-male combinations (144 mixtures). All signals where mixed at 0 dB and resampled to 16 kHz. We used the source-todistortion ratio (SDR), source-to-interference ratio (SIR), and source-to-artifact ratio (SAR) from the BSS-EVAL metrics (Vincent et al. (2006)). We report the average over the both speakers, as the measure are not symmetric.
Non-discriminative settings: As a basline for the non-discriminative setting we used standard NMFwith STFT of frame lengths of 1024 samples and 50% overlap, leading to 513 feature vectors. The dictionaries were chosen with 200 and 400 atoms. We evaluated the proposed scattering features in combination with NMF (as described in Section 4.1) with one and two layers, referred as scattNMF1 and scatt-NMF2 respectively. We use complex Morlet wavelets with Q 1 = 32 voices per octave in the first level, and dyadic Morlet wavelets ( Q 2 = 1 ) for the second level, for a review on Morlet wavelets refer to Mallat (1999). The resulting representation had 175 coefficients for the first
Table 1: Source separation results on a multi-speaker settings. Average SDR, SIR and SAR (in dB ) for different methods. Standard deviation of each result shown between brackets.
level and around 2000 for the second layer. We used 400 atoms for scatt-NMF1 and 1000 atoms for scatt-NMF2 . In all cases, the features were frame-wise normalized and we used λ = 0 . 1 . In all cases, parameters were obtained using cross-validation on a few clips separated from the training as a validation set.
Discriminative settings: We use a single and multi-frame DNN s as a baseline for this training setting.The network architectures consist of two hidden layers using the outputs of the first layer of scattering, that is, the CQT coefficients at a given temporal position. It uses RELU's as in the rest of the architectures and the output is normalize so that it corresponds to the spectral mask discussed in (3). The multi-frame version considers the concatenation of 5 frames as inputs matching the temporal context of the tested multi-resolution versions. We used 512 and 150 units for the singleframe DNN (referred as CQT-DNN ) and 1024 and 512 for the multi-frame one (referred as CQTDNN-5 ), increasing the number of parameters did not improve the results. We optimize the network to optimize the MSE to each of the sources. We also include the architectures DNN-multi and CNN-multi described in Section 4.2. In all cases the weights are randomly initialized and training is performed using stochastic gradient descent with momentum. We used the GPU-enabled package Matconvnet (Simonyan & Zisserman (2014)).
Table 1 shows the results obtained for the speaker-specific and multi-speaker settings. In all cases we observe that the one layer scattering transform outperforms the STFT in terms of SDR. Furthermore, there is a tangible gain in including a deeper representation; scatt-NMF2 performs always better than scatt-NMF1 . While the gain in the SDR and SAR are relatively small the SIR is 3dB higher. It is thus benefitial to consider a longer temporal context in order to perform the separation sucessfully.
Onthe other hand, as expected, the discriminative training yields very significant improvements. The same reasons that produced the improvements in the non-discriminative setting also have an impact in the discriminative case. Adding enough temporal contexts to the neural regressors improves their performance. The multi-temporal representation plays a key role as simply augmenting the number of frames does not lead to better performance (at least using baseline DNNs). It remains to be seen how these architectures would compare with the alternative RNN models.
Discussion
$$ \label{ssep} y(t) = x_1(t) + x_2(t), $$ \tag{ssep}
$$ \hat{x}_i = \mathcal{S}^{-1} \left{ M_i\circ \mathcal{S} {y}\right }, \quad \textrm{with} \quad M_i =\frac{\Phi(\hat{x}i)^p }{ \sum{l=1,2} \Phi(\hat{x}_l)^p}, \label{eq:rec} $$ \tag{eq:rec}
$$
\label{ideal_model}
\min_{x'i, Z_i\geq 0} \sum{i=1,2} \mathcal{D}( \Phi(x'_i) | D_i Z_i ) + \lambda \mathcal{R}(Z_i)\quad,s.t. y=x'_1 + x'_2.
$$ \tag{ideal_model}
$$
W x = {x \ast \phi_1(t)~,x \ast \psi_\lambda(t) }_{\lambda \in \Lambda}.
$$
We have observed that the performance of baseline source separation algorithms can be improved by using a temporal multi-resolution representation. The representation is able to integrate information across longer temporal contexts while removing uninformative variability with a relatively low parameter budget. In line with recent findings in the literature, we have observed that including discriminative criteria in the training leads to significant improvements in the source separation performance. However, contrary to standard sparse modeling in which the resulting inference can be readily approximated with a neural network, it remains unclear whether phase-recovery type inference can also be efficiently approximated with neural network architectures. We believe there might still be a gap in performance that might be bridged with appropriate discriminative architectures.
While this report presents shows some promising initial results, several interesting comparisons need to be made and are subject of current research. We consider an interesting problem exploring the best way of including the long-term temporal consistency into the estimation. Recent studies have evaluate the use of deep RNN's for solving the source separation problem Huang et al. (2014a); Weninger et al. (2014b). While Huang et al. (2014a) do not observe significant improvements over
standard DNN's in speech separation, Weninger et al. (2014b) obtain significant improvements using LSTM-DRNN in speech enhancement. We are currently addressing the question of comparing different neural network architectures that exploit temporal dependancies and assessing whether the use of multi-resolution representation can play a role as in this initial study.
| SDR | SIR | SAR | |
|---|---|---|---|
| NMF | 6.1 [2.9] | 14.1 [3.8] | 7.4 [2.1] |
| scatt-NMF 1 | 6.2 [2.8] | 13.5 [3.5] | 7.8 [2.2] |
| scatt-NMF 2 | 6.9 [2.7] | 16.0 [3.5] | 7.9 [2.2] |
| CQT-DNN | 9.4 [3.0] | 17.7 [4.2] | 10.4 [2.6] |
| CQT-DNN-5 | 9.2 [2.8] | 17.4 [4.0] | 10.3 [2.4] |
| CQT-DNN-multi | 9.7 [3.0] | 19.6 [4.4] | 10.4 [2.7] |
| CQT-CNN-multi | 9.9 [3.1] | 19.8 [4.2] | 10.6 [2.8] |
Table: S5.T1: Source separation results on a multi-speaker settings. Average SDR, SIR and SAR (in dB𝑑𝐵dB) for different methods. Standard deviation of each result shown between brackets.
| SDR | SIR | SAR | |
| NMF | 6.1 [2.9] | 14.1 [3.8] | 7.4 [2.1] |
| scatt-NMF1 | 6.2 [2.8] | 13.5 [3.5] | 7.8 [2.2] |
| scatt-NMF2 | 6.9 [2.7] | 16.0 [3.5] | 7.9 [2.2] |
| CQT-DNN | 9.4 [3.0] | 17.7 [4.2] | 10.4 [2.6] |
| CQT-DNN-5 | 9.2 [2.8] | 17.4 [4.0] | 10.3 [2.4] |
| CQT-DNN-multi | 9.7 [3.0] | 19.6 [4.4] | 10.4 [2.7] |
| CQT-CNN-multi | 9.9 [3.1] | 19.8 [4.2] | 10.6 [2.8] |
$$ ||W^{k}|x-|W^{k}|x^{\prime}|\leq||W^{k-1}|x-|W^{k-1}|x^{\prime}|\leq|x-x^{\prime}|~{}. $$ \tag{S3.Ex4}
$$ |W^{1}y|^{2}=|W^{1}x_{1}|^{2}+|W^{1}x_{2}|^{2}~{}, $$ \tag{S4.Ex6}
$$ \displaystyle\min_{D_{1}\geq 0,D_{2}\geq 0},,\mathcal{D}(\Phi(x_{1})|{D}{1}Z{1}^{\ast})+\alpha\mathcal{D}(\Phi(x_{2})|{D}{2}Z{2}^{\ast}), $$
References
[deepscatt] And{'e}n, J. and Mallat, S. \newblock Deep scattering spectrum. \newblock arXiv preprint arXiv:1304.6763, 2013.
[BL] Boulanger-Lewandowski, N., Mysore, G.J., and Hoffman, M. \newblock Exploiting long-term temporal dependencies in nmf using recurrent neural networks with application to source separation. \newblock In ICASSP, pp.\ 6969--6973, May 2014.
[icassp_sounds] Bruna, J. and Mallat, S. \newblock Audio texture synthesis with scattering moments. \newblock arXiv preprint arXiv:1311.0407, 2013{a}.
[pami] Bruna, J. and Mallat, S. \newblock Invariant scattering convolution networks. \newblock Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35\penalty0 (8):\penalty0 1872--1886, 2013{b}.
[icassp14] Bruna, J., Sprechmann, P., and Lecun, Yann. \newblock Source separation with scattering non-negative matrix factorization. \newblock submitted, 2014.
[superres] Dong, Chao, Loy, ChenChange, He, Kaiming, and Tang, Xiaoou. \newblock Learning a deep convolutional network for image super-resolution. \newblock In Fleet, David, Pajdla, Tomas, Schiele, Bernt, and Tuytelaars, Tinne (eds.), Computer Vision ? ECCV 2014, volume 8692 of Lecture Notes in Computer Science, pp.\ 184--199. 2014. \newblock 10.1007/978-3-319-10593-2_13.
[DuanMS12] Duan, Z., Mysore, G.~J., and Smaragdis, P. \newblock Online plca for real-time semi-supervised source separation. \newblock In LVA/ICA, pp.\ 34--41, 2012.
[fevotte2011majorization] F{'e}votte, C. \newblock Majorization-minimization algorithm for smooth itakura-saito nonnegative matrix factorization. \newblock In ICASSP, pp.\ 1980--1983. IEEE, 2011.
[fevotte2011algorithms] F{'e}votte, C. and Idier, J. \newblock Algorithms for nonnegative matrix factorization with the $\beta$-divergence. \newblock Neural Computation, 23\penalty0 (9):\penalty0 2421--2456, 2011.
[icassp13a] F'evotte, C., Roux, J.~Le, and Hershey, J.~R. \newblock Non-negative dynamical system with application to speech and audio. \newblock In ICASSP, 2013.
[yonina] Gerchberg, R.~W. and Saxton, W.~Owen. \newblock {A practical algorithm for the determination of the phase from image and diffraction plane pictures}. \newblock Optik, 35:\penalty0 237--246, 1972.
[LecunNN] Gregor, K. and LeCun, Y. \newblock Learning fast approximations of sparse coding. \newblock In ICML, pp.\ 399--406, 2010.
[HanMP12] Han, J., Mysore, G.~J., and Pardo, B. \newblock Audio imputation using the non-negative hidden markov model. \newblock In LVA/ICA, pp.\ 347--355, 2012.
[hansler2008speech] H{"a}nsler, E. and Schmidt, G. \newblock Speech and {Audio {P}rocessing in {A}dverse {E}nvironments}. \newblock Springer, 2008.
[Huang_DNN_Separation_ICASSP2014] Huang, P.-S., Kim, M., Hasegawa-Johnson, M., and Smaragdis, P. \newblock Deep learning for monaural speech separation. \newblock In ICASSP, pp.\ 1562--1566, 2014{a}.
[huang2014singing] Huang, Po-Sen, Kim, Minje, Hasegawa-Johnson, Mark, and Smaragdis, Paris. \newblock Singing-voice separation from monaural recordings using deep recurrent neural networks. \newblock ISMIR, 2014{b}.
[NMF] Lee, D.D. and Seung, H.S. \newblock Learning parts of objects by non-negative matrix factorization. \newblock Nature, 401\penalty0 (6755):\penalty0 788--791, 1999.
[loizou2007speech] Loizou, P.C. \newblock Speech {Enhancement: {T}heory and {P}ractice}, volume30. \newblock CRC, 2007.
[mairal2012task] Mairal, J., Bach, F., and Ponce, J. \newblock Task-driven dictionary learning. \newblock Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34\penalty0 (4):\penalty0 791--804, 2012.
[wavelettour] Mallat, St{'e}phane. \newblock A wavelet tour of signal processing. \newblock Academic press, 1999.
[eurispco] Mallat, St{'e}phane. \newblock Recursive interferometric representation. \newblock In Proc. of EUSICO conference, Denmark, 2010.
[mohamed2012understanding] Mohamed, Abdel-rahman, Hinton, Geoffrey, and Penn, Gerald. \newblock Understanding how deep belief networks perform acoustic modelling. \newblock In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pp.\ 4273--4276. IEEE, 2012.
[mohammadiha2013supervised] Mohammadiha, N., Smaragdis, P., and Leijon, A. \newblock Supervised and unsupervised speech enhancement using nonnegative matrix factorization. \newblock Audio, Speech, and Language Processing, IEEE Transactions on, 21\penalty0 (10):\penalty0 2140--2151, 2013.
[MysoreS11] Mysore, G.~J. and Smaragdis, P. \newblock A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics. \newblock In ICASSP, pp.\ 17--20, 2011.
[schmidt06speechseparation] Schmidt, M.~N. and Olsson, R.~K. \newblock Single-channel speech separation using sparse non-negative matrix factorization. \newblock In INTERSPEECH, Sep 2006.
[schmidt07mlsp] Schmidt, M.~N., Larsen, J., and Hsiao, F.-T. \newblock Wind noise reduction using non-negative sparse coding. \newblock In MLSP, pp.\ 431--436, Aug 2007.
[deblur_mpi] Schuler, Ch., Hirsch, M., Harmeling, S., and Scholkopf, B. \newblock Learning to deblur. \newblock arXiv preprint arXiv:1406.7444, 2014.
[shashanka_icassp07] Shashanka, M. V.~S., Raj, B., and Smaragdis, P. \newblock {Sparse Overcomplete Decomposition for Single Channel Speaker Separation}. \newblock In ICASSP, 2007.
[matconvnet] Simonyan, Karen and Zisserman, Andrew. \newblock Very deep convolutional networks for large-scale image recognition. \newblock arXiv preprint arXiv:1409.1556, 2014.
[smaragdis2014static] Smaragdis, P., Fevotte, C., Mysore, G, Mohammadiha, N., and Hoffman, M. \newblock Static and dynamic source separation using nonnegative factorizations: A unified view. \newblock Signal Processing Magazine, IEEE, 31\penalty0 (3):\penalty0 66--75, 2014.
[sprechmann2013learnable] Sprechmann, P., Bronstein, A., Bronstein, M., and Sapiro, G. \newblock Learnable low rank sparse models for speech denoising. \newblock In ICASSP, pp.\ 136--140, 2013.
[sprechmann2014supervised] Sprechmann, P., Bronstein, A.~M., and Sapiro, G. \newblock Supervised non-euclidean sparse {NMF} via bilevel optimization with applications to speech enhancement. \newblock In HSCMA, pp.\ 11--15. IEEE, 2014.
[sprechmann2012real] Sprechmann, Pablo, Bronstein, Alexander~M, and Sapiro, Guillermo. \newblock Real-time online singing voice separation from monaural recordings using robust low-rank modeling. \newblock In ISMIR, pp.\ 67--72. Citeseer, 2012.
[vincent2006performance] Vincent, E., Gribonval, R., and F{'e}votte, C. \newblock Performance measurement in blind audio source separation. \newblock IEEE Trans. on Audio, Speech, and Lang. Proc., 14\penalty0 (4):\penalty0 1462--1469, 2006.
[weninger2014discriminative] Weninger, F., Le~Roux, J., Hershey, J.~R, and Watanabe, S. \newblock Discriminative {NMF} and its application to single-channel source separation. \newblock Proc. of ISCA Interspeech, 2014{a}.
[Weninger2014GlobalSIP12] Weninger, Felix, {Le Roux}, Jonathan, Hershey, John~R., and Schuller, Bj{"o}rn. \newblock Discriminatively trained recurrent neural networks for single-channel speech separation. \newblock In Proc. IEEE GlobalSIP 2014 Symposium on Machine Learning Applications in Speech Processing, 2014{b}.
[WilsonRSD08] Wilson, K.~W., Raj, B., Smaragdis, P., and Divakaran, A. \newblock Speech denoising using nonnegative matrix factorization with priors. \newblock In ICASSP, pp.\ 4029--4032, 2008.
[bib1] Andén, J. and Mallat, S. Deep scattering spectrum. arXiv preprint arXiv:1304.6763, 2013.
[bib2] Boulanger-Lewandowski et al. (2014) Boulanger-Lewandowski, N., Mysore, G.J., and Hoffman, M. Exploiting long-term temporal dependencies in nmf using recurrent neural networks with application to source separation. In ICASSP, pp. 6969–6973, May 2014.
[bib3] Bruna, J. and Mallat, S. Audio texture synthesis with scattering moments. arXiv preprint arXiv:1311.0407, 2013a.
[bib4] Bruna, J. and Mallat, S. Invariant scattering convolution networks. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1872–1886, 2013b.
[bib5] Bruna et al. (2014) Bruna, J., Sprechmann, P., and Lecun, Yann. Source separation with scattering non-negative matrix factorization. submitted, 2014.
[bib6] Dong et al. (2014) Dong, Chao, Loy, ChenChange, He, Kaiming, and Tang, Xiaoou. Learning a deep convolutional network for image super-resolution. In Fleet, David, Pajdla, Tomas, Schiele, Bernt, and Tuytelaars, Tinne (eds.), Computer Vision ? ECCV 2014, volume 8692 of Lecture Notes in Computer Science, pp. 184–199. 2014. doi: 10.1007/978-3-319-10593-2˙13.
[bib7] Duan et al. (2012) Duan, Z., Mysore, G. J., and Smaragdis, P. Online plca for real-time semi-supervised source separation. In LVA/ICA, pp. 34–41, 2012.
[bib8] Févotte, C. Majorization-minimization algorithm for smooth itakura-saito nonnegative matrix factorization. In ICASSP, pp. 1980–1983. IEEE, 2011.
[bib9] Févotte, C. and Idier, J. Algorithms for nonnegative matrix factorization with the β𝛽\beta-divergence. Neural Computation, 23(9):2421–2456, 2011.
[bib10] Févotte et al. (2013) Févotte, C., Roux, J. Le, and Hershey, J. R. Non-negative dynamical system with application to speech and audio. In ICASSP, 2013.
[bib11] Gerchberg, R. W. and Saxton, W. Owen. A practical algorithm for the determination of the phase from image and diffraction plane pictures. Optik, 35:237–246, 1972.
[bib12] Gregor, K. and LeCun, Y. Learning fast approximations of sparse coding. In ICML, pp. 399–406, 2010.
[bib13] Han et al. (2012) Han, J., Mysore, G. J., and Pardo, B. Audio imputation using the non-negative hidden markov model. In LVA/ICA, pp. 347–355, 2012.
[bib14] Hänsler, E. and Schmidt, G. Speech and Audio Processing in Adverse Environments. Springer, 2008.
[bib15] Huang et al. (2014a) Huang, P.-S., Kim, M., Hasegawa-Johnson, M., and Smaragdis, P. Deep learning for monaural speech separation. In ICASSP, pp. 1562–1566, 2014a.
[bib16] Huang et al. (2014b) Huang, Po-Sen, Kim, Minje, Hasegawa-Johnson, Mark, and Smaragdis, Paris. Singing-voice separation from monaural recordings using deep recurrent neural networks. ISMIR, 2014b.
[bib17] Lee, D.D. and Seung, H.S. Learning parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
[bib18] Loizou, P. C. Speech Enhancement: Theory and Practice, volume 30. CRC, 2007.
[bib19] Mairal et al. (2012) Mairal, J., Bach, F., and Ponce, J. Task-driven dictionary learning. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(4):791–804, 2012.
[bib20] Mallat, Stéphane. A wavelet tour of signal processing. Academic press, 1999.
[bib21] Mallat, Stéphane. Recursive interferometric representation. In Proc. of EUSICO conference, Denmark, 2010.
[bib22] Mohamed et al. (2012) Mohamed, Abdel-rahman, Hinton, Geoffrey, and Penn, Gerald. Understanding how deep belief networks perform acoustic modelling. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pp. 4273–4276. IEEE, 2012.
[bib23] Mohammadiha et al. (2013) Mohammadiha, N., Smaragdis, P., and Leijon, A. Supervised and unsupervised speech enhancement using nonnegative matrix factorization. Audio, Speech, and Language Processing, IEEE Transactions on, 21(10):2140–2151, 2013.
[bib24] Mysore, G. J. and Smaragdis, P. A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics. In ICASSP, pp. 17–20, 2011.
[bib25] Schmidt, M. N. and Olsson, R. K. Single-channel speech separation using sparse non-negative matrix factorization. In INTERSPEECH, Sep 2006.
[bib26] Schmidt et al. (2007) Schmidt, M. N., Larsen, J., and Hsiao, F.-T. Wind noise reduction using non-negative sparse coding. In MLSP, pp. 431–436, Aug 2007.
[bib27] Schuler et al. (2014) Schuler, Ch., Hirsch, M., Harmeling, S., and Scholkopf, B. Learning to deblur. arXiv preprint arXiv:1406.7444, 2014.
[bib28] Shashanka et al. (2007) Shashanka, M. V. S., Raj, B., and Smaragdis, P. Sparse Overcomplete Decomposition for Single Channel Speaker Separation. In ICASSP, 2007.
[bib29] Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[bib30] Smaragdis et al. (2014) Smaragdis, P., Fevotte, C., Mysore, G, Mohammadiha, N., and Hoffman, M. Static and dynamic source separation using nonnegative factorizations: A unified view. Signal Processing Magazine, IEEE, 31(3):66–75, 2014.
[bib31] Sprechmann et al. (2013) Sprechmann, P., Bronstein, A., Bronstein, M., and Sapiro, G. Learnable low rank sparse models for speech denoising. In ICASSP, pp. 136–140, 2013.
[bib32] Sprechmann et al. (2014) Sprechmann, P., Bronstein, A. M., and Sapiro, G. Supervised non-euclidean sparse NMF via bilevel optimization with applications to speech enhancement. In HSCMA, pp. 11–15. IEEE, 2014.
[bib33] Sprechmann et al. (2012) Sprechmann, Pablo, Bronstein, Alexander M, and Sapiro, Guillermo. Real-time online singing voice separation from monaural recordings using robust low-rank modeling. In ISMIR, pp. 67–72. Citeseer, 2012.
[bib34] Vincent et al. (2006) Vincent, E., Gribonval, R., and Févotte, C. Performance measurement in blind audio source separation. IEEE Trans. on Audio, Speech, and Lang. Proc., 14(4):1462–1469, 2006.
[bib35] Weninger et al. (2014a) Weninger, F., Le Roux, J., Hershey, J. R, and Watanabe, S. Discriminative NMF and its application to single-channel source separation. Proc. of ISCA Interspeech, 2014a.
[bib36] Weninger et al. (2014b) Weninger, Felix, Le Roux, Jonathan, Hershey, John R., and Schuller, Björn. Discriminatively trained recurrent neural networks for single-channel speech separation. In Proc. IEEE GlobalSIP 2014 Symposium on Machine Learning Applications in Speech Processing, 2014b.
[bib37] Wilson et al. (2008) Wilson, K. W., Raj, B., Smaragdis, P., and Divakaran, A. Speech denoising using nonnegative matrix factorization with priors. In ICASSP, pp. 4029–4032, 2008.