A brief overview of the selected research problems tackled by the Signal Processing Group are given below.
Sound source separation
Sound source separation refers to the problem of extracting the signals of individual sound sources from the microphone mixture. It is often performed in the so-called blind scenario, in which separation is carried with limited or no additional information about the sources or the mixing process. The main applications of sound source separation include the recovery of individual recordings of instruments from a professionally mixed music piece and the isolation of speech of individual speakers from the microphone recordings in the so-called cocktail party scenario with multiple simultaneously active speakers. An example an under-determined scenario in which we aim to extract the signals of four musical instruments from two microphone recordings is shown.
In principle, separation may be carried out using beamforming, independent component analysis (ICA), non-negative matrix factorization (NMF), nonnegative tensor factorization (NTF), and deep neural networks (DNN). In the Signal Processing Group, the research is focused primarily on the latter two approaches popular in audio processing and machine learning. NTF is an unsupervised data decomposition technique which consists in factorization of non-negative matrices with power spectra of an audio mixture into a sum of non-negative matrices representing source frequency profiles, time activations and a mapping matrix. This method is widely used for source separation because of the ease of exploiting the pre-trained frequency profiles of the sources and incorporation of various types of priors on the harmonic structure of the sources, smoothness of the respective factorized matrices, and the localization prior. In particular, we have extended the so-called sub-source EM algorithm based on the multiplicative update (MU) rules by incorporating smoothness on the factorized matrices and including the localization prior. Current research efforts include the DNN-based separation with application to separating musical instruments from audio channels and extracting individual speech signals from the microphone recordings in the cocktail party scenarios. Spectograms and non-negative matrices after decomposition are provided as an example.
Maximum Likelihood (ML) acoustic source localization in spherical harmonic domain
Direction of arrival (DoA) of a source is one of fundamental parameters of a multichannel audio recording and is commonly utilized both as a end result or as an input for further signal processing. When using circular or spherical microphone arrays, a change of the signal domain to e.g. the spherical harmonics domain may offer some advantages. Separation of frequency-dependent and angular-dependent components in the signal model provides a straightforward manner to expand classical narrowband localization algorithms such as MUltiple SIgnal Classification (MUSIC) to the broadband case. In the group, we develop novel in the field of audio DoA estimation methods, such as Stochasic Maximum Likelihood (SML) or Deterministic Maximum Likelihood (DML). Their application has been shown to improve the localization accuracy, increase the robustness in case of difficult acoustic conditions with reverberation and noise, and contrary to most standard algorithms enables the localization in case of correlated source signals. The following picture presents an example experimental setup consisting of a spherical microphone array that records 4 sources emitting correlated signals and the average angular localization error as a function of frequency for MUSIC, SML and DML.
Sound source localization based on multichannel audio recorded with microphone array integrated with a Micro Aerial Vehicle (MAV)
Acoustic sensing scenarios including microphone array mounted under a Micro Aerial Vehicle, commonly referred to as a drone, are an emerging signal processing challenge and as they gain importance they are becoming widely considered, for example during search and rescue operations. To overcome the heavy noise conditions encountered within the aforementioned localization setup, which are the result MAV’s propulsion system in close proximity of and the microphone array, several solutions have been developed. The first development is referred to as Spatio-Spectral Masking (SSM) and it aims to estimate a set of frequency-dependent masks that are applied to the angular pseudo-spectra to lessen the parts corresponding to MAV’s ego-noise. The SSM is based on Principal Component Analysis (PCA) of the angular pseudo-spectra computed for noise-only recordings database. The picture below presents example SSM masks: subfigure (a,d) presents principal components obtained for the first rotor, subfigure (b,e) shows SSM masks for the first rotors, and subfigure (c,f) shows the final merged SSM masks. The upper row(a-c) and the bottom row (d-f) present the results for 1200Hz 2000Hz, respectively. Crosses represent position of the rotors, while red cross denotes active rotor and black cross denotes an inactive rotor. Black dots represent positions of the microphones inside the array.
The second development referred to as Time-Frequency Masking (TFM) is a method that allows to detect which time-frequency bins carry most useful information about the source location, thus allowing the exclusion of the ego-noise contaminated narrowband angular pseudo-spectra. An example of ideal and estimated masks is shown in the picture.
The last picture presents an example of a broadband spatial pseudo-spectrum without and with the application of TFM. The evaluation shows that joint application of both masking techniques significantly improves the localization accuracy, actually enabling to the successful localization.