REAL-TIME TARGET SOUND EXTRACTION

Bandhav Veluri^◇, Justin Chan^◇, Malek Itani^◇, Tuochao Chen^◇, Takuya Yoshioka^●, Shyamnath Gollakota^◇

^◇ Paul G. Allen School of Computer Science & Engineering, University of Washington, USA
^● Microsoft, One Microsoft Way, Redmond, WA, USA

Abstract¶

We present the first neural network model to achieve real-time and streaming target sound extraction. To accomplish this, we propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder. This hybrid architecture uses dilated causal convolutions for processing large receptive fields in a computationally efficient manner, while also benefiting from the performance transformer based architectures provide. Our evaluations show as much as 2.2–3.3 dB improvement in SI-SNRi compared to the prior models for this task, while having a 1.2–4x smaller model size and a 1.5–2x lower runtime.

[Paper] [Code] [Web App]

Single-target extraction samples¶

Inputs: ['Harmonica', 'Snare_drum', 'Violin_or_fiddle', 'Cowbell'] | Targets: ['Snare_drum']

Input mixture:¶

Ground-truth:¶

Output of Waveformer:¶

Output of Conv-TasNet:¶

Output of ReSepformer:¶

Inputs: ['Squeak', 'Fireworks', 'Bark', 'Microwave_oven', 'Computer_keyboard'] | Targets: ['Squeak']

Input mixture:¶

Ground-truth:¶

Output of Waveformer:¶

Output of Conv-TasNet:¶

Output of ReSepformer:¶

Two-target extraction samples¶

Inputs: ['Acoustic_guitar', 'Oboe', 'Bark', 'Writing', 'Finger_snapping'] | Targets: ['Bark', 'Writing']

Input mixture:¶

Ground-truth:¶

Output of Waveformer:¶

Output of Conv-TasNet:¶

Output of ReSepformer:¶

Inputs: ['Harmonica', 'Applause', 'Keys_jangling', 'Microwave_oven', 'Bass_drum'] | Targets: ['Bass_drum', 'Microwave_oven']

Input mixture:¶

Ground-truth:¶

Output of Waveformer:¶

Output of Conv-TasNet:¶

Output of ReSepformer:¶

Three-target extraction samples¶

Inputs: ['Shatter', 'Violin_or_fiddle', 'Knock', 'Bass_drum', 'Fireworks'] | Targets: ['Fireworks', 'Knock', 'Shatter']

Input mixture:¶

Ground-truth:¶

Output of Waveformer:¶

Output of Conv-TasNet:¶

Output of ReSepformer:¶

Inputs: ['Gunshot_or_gunfire', 'Drawer_open_or_close', 'Electric_piano', 'Cough', 'Burping_or_eructation'] | Targets: ['Drawer_open_or_close', 'Electric_piano', 'Gunshot_or_gunfire']

Input mixture:¶

Ground-truth:¶

Output of Waveformer:¶

Output of Conv-TasNet:¶

Output of ReSepformer:¶