Deep Extraction for Robust Speech Recognition (DERASR)

Project title: Deep Extraction for Robust Speech Recognition
Project title in Polish: Głęboka ekstrakcja w celu niezawodnego rozpoznawania mowy
Project type: Research project
Funding Institution: National Science Centre, Poland (NCN)
Program: SONATA BIS
Project No: 2021/42/E/ST7/00452
Value: 1 878 000 PLN (436 745 EUR)
Duration: 2023 – 2028

Project team members:
dr hab. inż. Konrad Kowalczyk, prof. AGH – Principal Investigator (PI)
dr inż. Stanisław Kacprzak – Postdoctoral Researcher
dr inż. Marcin Witkowski – Postdoctoral Researcher
mgr inż. Mateusz Barański – PhD Student
mgr inż. Jan Jasiński – PhD Student
mgr inż. Julitta Bartolewska – PhD Student
mgr inż. Mieszko Fraś – PhD Student

Motivation
Over the recent years we have seen an emergence a various voice-based applications, devices, and services in which human speech is recognized automatically by a machine. Although automatic speech recognition (ASR) has been an active research field for several decades, a real irruption of new research opportunities but also challenges has been associated with the emergence of deep learning and artificial intelligence. The performance of recent deep learning based ASR systems for speech of a single speaker in good acoustic conditions is reaching nearly that of humans. However, when speech is recorded in real acoustic environments with interfering sound sources and background noise, especially when more speakers are active at the same time, its performance typically dramatically drops, often even disabling the technology from use in such difficult acoustic conditions.

Project goal
The main goal of the proposed project is to make automatic speech recognition remarkably more robust against acoustically challenging conditions than it is possible today. The research goal is to study the challenges and subsequently develop cutting-edge deep learning based models that enable deep extraction of such representations that yield automatic recognition of the desired speech with increased accuracy and robustness. The project ambition is to enable human-machine voice-based communication operate robustly in the majority of real life scenarios by designing fully deep models for this task. We will also incorporate additional information about the speaker or localization in order to achieve improved speech recognition results from single- and multichannel recordings of speech in acoustically difficult conditions. The project is also aimed at advancing and creating ASR models for Polish language. The project results will lead to more robust ASR DNN based models that are tailored to more difficult but also more natural application scenarios.