Results and Demos

Madhusudana Shashanka

In this page, I have listed results, example waveforms and other details for some of the projects I have worked on. Click on one of the projects listed below.

Separating a Foreground Singer from Background Music.
Sparse Overcomplete Decomposition for Speaker Separation
Latent Dirichlet Decompsition for Single Channel Speaker Separation (in ICASSP 2006)

Music Analysis

Below are examples of lead vocalist/guitar separated from music clips using probabilistic latent decomposition. Different attempts point to different combinations of the number of basis vectors used to model foreground and background.

Song clip - from the song "Raise my Rent" by David Gilmour.

Attempt 1    Lead Guitar    Background Music

Attempt 2    Lead Guitar    Background Music

Song clip - from the song "Bande" from the soundtrack of Hindi Movie "Black Friday".

Attempt 1    Lead Singer    Background Music

Attempt 2    Lead Singer    Background Music

Attempt 3    Lead Singer    Background Music

Song clip - from the song "Sayonee" by the band Junoon.

Attempt 1    Lead Singer    Background Music

Song clip - from the song "Sunrise" by Norah Jones.

Attempt 1    Lead Singer    Background Music

Song clip - from the song "Super Freak" by Rick James.

Attempt 1    Lead Singer    Background Music

Sparse Decomposition

Below are example reconstructions using a compact code of 100 basis functions (denoted as CC), and a sparse-distributed code of 1000 basis functions (denoted as SC). In brackets are given two metrics that give an idea of the quality of separation. SNR is the Signal-to-Noise ratio improvement and SER is the speaker energy ratio (refer to the paper for details). Notice the improvements obtained by using the sparse code as compared to the compact code.

 Mixture

CC    Speaker 1 (SNR = 5.2462 dB, SER = 3.9108 dB)     Speaker 2 (SNR = 5.3099 dB, SER = 3.8762 dB)

SC    Speaker 1 (SNR = 8.0237 dB, SER = 7.1185 dB)     Speaker 2 (SNR = 8.3280 dB, SER = 6.0294 dB)

 Mixture

CC    Speaker 1 (SNR = 4.5151 dB, SER = 3.1108 dB)     Speaker 2 (SNR = 4.5222 dB, SER = 3.7220 dB)

SC    Speaker 1 (SNR = 7.7060 dB, SER = 6.5171 dB)     Speaker 2 (SNR = 8.3466 dB, SER = 5.3747 dB)

 Mixture

CC    Speaker 1 (SNR = 3.8210 dB, SER = 2.4190 dB)     Speaker 2 (SNR = 3.8034 dB, SER = 3.1263 dB)

SC    Speaker 1 (SNR = 8.9056 dB, SER = 6.6871 dB)     Speaker 2 (SNR = 9.1632 dB, SER = 6.4651 dB)

 Mixture

CC    Speaker 1 (SNR = 2.6348 dB, SER = 0.8491 dB)     Speaker 2 (SNR = 2.6848 dB, SER = 0.8005 dB)

SC    Speaker 1 (SNR = 5.5570 dB, SER = 3.2883 dB)     Speaker 2 (SNR = 5.3339 dB, SER = 3.8060 dB)

 Mixture

CC    Speaker 1 (SNR = 3.5512 dB, SER = 1.5516 dB)     Speaker 2 (SNR = 3.2103 dB, SER = 1.6673 dB)

SC    Speaker 1 (SNR = 5.6195 dB, SER = 3.3031 dB)     Speaker 2 (SNR = 5.4566 dB, SER = 4.0793 dB)

 Mixture

CC    Speaker 1 (SNR = 2.4406 dB, SER = 1.2180 dB)     Speaker 2 (SNR = 2.6402 dB, SER = 0.8387 dB)

SC    Speaker 1 (SNR = 5.0141 dB, SER = 2.6416 dB)     Speaker 2 (SNR = 4.6933 dB, SER = 3.9433 dB)

 Mixture

CC    Speaker 1 (SNR = 2.2041 dB, SER = 0.9382 dB)     Speaker 2 (SNR = 2.4872 dB, SER = 0.5929 dB)

SC    Speaker 1 (SNR = 5.2073 dB, SER = 2.9452 dB)     Speaker 2 (SNR = 4.9007 dB, SER = 3.9773 dB)

 Mixture

CC    Speaker 1 (SNR = 2.0526 dB, SER = 0.3425 dB)     Speaker 2 (SNR = 1.9835 dB, SER = 0.3993 dB)

SC    Speaker 1 (SNR = 5.1836 dB, SER = 3.2031 dB)     Speaker 2 (SNR = 4.8766 dB, SER = 4.1478 dB)

 Mixture

CC    Speaker 1 (SNR = 2.2411 dB, SER = 0.9203 dB)     Speaker 2 (SNR = 2.4314 dB, SER = 0.6920 dB)

SC    Speaker 1 (SNR = 3.4108 dB, SER = 4.0684 dB)     Speaker 2 (SNR = 4.0371 dB, SER = 1.9665 dB)

Latent Dirichlet Decomposition for Single Channel Speaker Separation

A set of utterances from the TIMIT database comprising approximately 25 seconds of speech was used as training data for each speaker. All signals were normalized to 0 mean and unit variance to ensure uniformity of signal level. Signals were analyzed in 64 ms windows with 32 ms overlap between windows. Spectral vectors were modelled by a mixture of 25 multinomial distributions. The training samples are given below

Speaker 1      Speaker 2      Speaker 3      Speaker 4      Speaker 5

Mixed signals were obtained by digitally adding test signals for both speakers. The length of the mixed signal was set to the shorter of the two signals. The component signals were all normalized to 0 mean and unit variance prior to addition, resulting in mixed signals with 0dB SNR for each speaker. The results of separating mixed signals are given below. Next to the reconstructions are given the average SNR improvement over the mixed signal.

NOTE: SNR improvements reported in Figure 4 in the paper are erroneous due to a bug in the calculation, please disregard them.

Mixture12      Reconstructed 1 5.0648 dB     Reconstructed 2  4.8282 dB

Mixture23      Reconstructed 2 6.1495 dB     Reconstructed 3  4.1943 dB

Mixture31      Reconstructed 3 5.3696 dB     Reconstructed 1  4.4613 dB

Mixture14      Reconstructed 1 5.2382 dB     Reconstructed 4  5.5066 dB

Mixture35      Reconstructed 3 4.7242 dB     Reconstructed 5  4.9549 dB

Mixture45      Reconstructed 4 4.1662 dB     Reconstructed 5  6.6353 dB

Mixture25      Reconstructed 2 4.6745 dB     Reconstructed 5  4.5936 dB