Department of Architecture, Design and Media Technology
PhD Defence by Yang Xiang

Create
SEMINAR ROOM: 4.521, RENDSBURGGADE 14, 9000 AALBORG, DENMARK.
10.02.2023 13:00 - 16:30
English
On location
Create
SEMINAR ROOM: 4.521, RENDSBURGGADE 14, 9000 AALBORG, DENMARK.
10.02.2023 13:00 - 16:30
English
On location
Department of Architecture, Design and Media Technology
PhD Defence by Yang Xiang

Create
SEMINAR ROOM: 4.521, RENDSBURGGADE 14, 9000 AALBORG, DENMARK.
10.02.2023 13:00 - 16:30
English
On location
Create
SEMINAR ROOM: 4.521, RENDSBURGGADE 14, 9000 AALBORG, DENMARK.
10.02.2023 13:00 - 16:30
English
On location
Title
Data-driven Speech Enhancement: from Non-negative Matrix Factorization to Deep Representation Learning.
Program
13:00 – 13:05 Moderator Kamal Nasrollahi welcomes the guests
13:05 - 13:50 Presentation by Yang Xiang
13:50 – 14:05 Break
14:05 – 16:00 (latest) Questions
16:00 – 16:30 Assessment
16:30 Reception and announcement for the committee
Assessment committee
Associate Professor Erkut Cumhur
Department of Architecture, Design & Media Technology, Aalborg University, Denmark
Professor Wenwu Wang
Centre for Vision Speech and Signal Processing, University of Surrey, England
Professor Nilesh Madhu
IDLab, Dept. of Electronics & Information Systems, Universiteit Gent – imec, Belgium
Supervisors
Professor Mads Græsbøll Christensen
Department of Architecture, Design & Media Technology, Aalborg University, Denmark
Doctor Morten Højfeldt Rasmussen
Capturi A/S, Denmark
Doctor Jesper Lisby Højvang
Capturi A/S, Denmark
Information
The defense will be conducted in-person.
If you wish to participate in the reception, please sign up via Doodle
Abstract
In natural listening environments, speech signals are easily distorted by various acoustic interference, which reduces the speech quality and intelligibility of human listening; meanwhile, it makes difficult for many speech-related applications, such as automatic speech recognition (ASR). Thus, many speech enhancement (SE) algorithms have been developed in the past decades. However, most current SE algorithms are difficult to capture underlying speech information (e.g., phoneme) in the SE process. This causes it to be challenging to know what specific information is lost or interfered with in the SE process, which limits the application of enhanced speech. For instance, some SE algorithms aimed to improve human listening usually damage the ASR system.
The objective of this dissertation is to develop SE algorithms that have the potential to capture various underlying speech representations (information) and improve the quality and intelligibility of noisy speech. This study starts by introducing the hidden Markov model (HMM) into the Non-negative Matrix Factorization (NMF) model (NMF-HMM) because HMM is a convenient way to find underlying speech information for better SE performance. The key idea is applying HMM to capture the underlying speech temporal dynamics information in the NMF model. Additionally, a computationally efficient method is also proposed to ensure that this NMF-HMM model can achieve fast online SE.
Although NMF-HMM captures the underlying speech information, it is difficult to explain what detailed information is obtained. In addition, NMF-HMM cannot represent the underlying information in a vector form, which makes information analysis difficult. To address these problems, we introduce deep representation learning (DRL) for SE. DRL can also improve the SE performance of DNN-based algorithms since DRL can obtain a discriminative speech representation, which can reduce the requirements for the learning machine to perform a task successfully. Specifically, we propose a Bayesian permutation training variational autoencoder (PVAE) to analyze underlying speech information for SE, which can represent and disentangle underlying noisy speech information in a vector form. The experimental results indicate that disentangled signal representations can also help current DNN-based SE algorithms achieve better SE performance. Additionally, based on this PVAE framework, we propose applying -VAE and generative adversarial networks to improve PVAE's information disentanglement and signal restoration ability, respectively.