Home

Department of Architecture, Design and Media Technology

PhD Defence by Yang Xiang

On Friday, February 10th, Yang Xiang will defend his PhD Thesis: "Data-driven Speech Enhancement: from Non-negative Matrix Factorization to Deep Representation Learning."

Create

SEMINAR ROOM: 4.521, RENDSBURGGADE 14, 9000 AALBORG, DENMARK.

  • 10.02.2023 13:00 - 16:30

  • English

  • On location

Create

SEMINAR ROOM: 4.521, RENDSBURGGADE 14, 9000 AALBORG, DENMARK.

10.02.2023 13:00 - 16:30

English

On location

Department of Architecture, Design and Media Technology

PhD Defence by Yang Xiang

On Friday, February 10th, Yang Xiang will defend his PhD Thesis: "Data-driven Speech Enhancement: from Non-negative Matrix Factorization to Deep Representation Learning."

Create

SEMINAR ROOM: 4.521, RENDSBURGGADE 14, 9000 AALBORG, DENMARK.

  • 10.02.2023 13:00 - 16:30

  • English

  • On location

Create

SEMINAR ROOM: 4.521, RENDSBURGGADE 14, 9000 AALBORG, DENMARK.

10.02.2023 13:00 - 16:30

English

On location

Title 

Data-driven Speech Enhancement: from Non-negative Matrix Factorization to Deep Representation Learning.

Program

13:00 – 13:05 Moderator  Kamal Nasrollahi welcomes the guests

13:05 - 13:50 Presentation by Yang Xiang

13:50 – 14:05 Break

14:05 – 16:00 (latest) Questions

16:00 – 16:30 Assessment

16:30 Reception and announcement for the committee

Assessment committee

Associate Professor Erkut Cumhur
Department of Architecture, Design & Media Technology, Aalborg University, Denmark

Professor Wenwu Wang
Centre for Vision Speech and Signal Processing, University of Surrey, England

Professor Nilesh Madhu
IDLab, Dept. of Electronics & Information Systems, Universiteit Gent – imec, Belgium

Supervisors

Professor Mads Græsbøll Christensen
Department of Architecture, Design & Media Technology, Aalborg University, Denmark

Doctor Morten Højfeldt Rasmussen
Capturi A/S, Denmark

Doctor Jesper Lisby Højvang
Capturi A/S, Denmark

Information

The defense will be conducted in-person.

If you wish to participate in the reception, please sign up via Doodle

Abstract 

In natural listening environments, speech signals are easily distorted by various acoustic interference, which reduces the speech quality and intelligibility of human listening; meanwhile, it makes difficult for many speech-related applications, such as automatic speech recognition (ASR). Thus, many speech enhancement (SE) algorithms have been developed in the past decades. However, most current SE algorithms are difficult to capture underlying speech information (e.g., phoneme) in the SE process. This causes it to be challenging to know what specific information is lost or interfered with in the SE process, which limits the application of enhanced speech. For instance, some SE algorithms aimed to improve human listening usually damage the ASR system.

The objective of this dissertation is to develop SE algorithms that have the potential to capture various underlying speech representations (information) and improve the quality and intelligibility of noisy speech. This study starts by introducing the hidden Markov model (HMM) into the Non-negative Matrix Factorization (NMF) model (NMF-HMM) because HMM is a convenient way to find underlying speech information for better SE performance. The key idea is applying HMM to capture the underlying speech temporal dynamics information in the NMF model. Additionally, a computationally efficient method is also proposed to ensure that this NMF-HMM model can achieve fast online SE. 

Although NMF-HMM captures the underlying speech information, it is difficult to explain what detailed information is obtained. In addition, NMF-HMM cannot represent the underlying information in a vector form, which makes information analysis difficult. To address these problems, we introduce deep representation learning (DRL) for SE. DRL can also improve the SE performance of DNN-based algorithms since DRL can obtain a discriminative speech representation, which can reduce the requirements for the learning machine to perform a task successfully. Specifically, we propose a Bayesian permutation training variational autoencoder (PVAE) to analyze underlying speech information for SE, which can represent and disentangle underlying noisy speech information in a vector form. The experimental results indicate that disentangled signal representations can also help current DNN-based SE algorithms achieve better SE performance. Additionally, based on this PVAE framework, we propose applying -VAE and generative adversarial networks to improve PVAE's information disentanglement and signal restoration ability, respectively.