Korean AI Association

  >   학술행사   >   국내학술대회


연사 및 초록 (18일)
김지환 교수 (서강대학교)
Title: Introduction to Speech Recognition


음성인식의 기본 개념을 이해하는 것을 목표로 한다. Seq2Seq 관점에서 음성인식의 문제를 정의한다. 딥러닝 기반의 음성인식 주요 기술을 DNN-WFST, CTC, RNN-T, Seq2Seq, Tramsformer 중심으로 살펴보고, 언어모델은 n-gram 중심으로, 디코딩 네트워크는 WFST 중심으로 공부한다.

수강대상: 음성인식 공학적 문제 정의와 알고리즘은 처음 접하는 분 (단, MLP, Back-propagation은 이해하고 있는 것을 전제로 진행함, RNN, Seq-to-seq, attention, Transformer에 대해 기본 개념을 공부한 적이 있으면, 강의 내용 이해에 도움이 됨)  



2007~현재 서강대학교 교수 (컴퓨터공학과/인공지능학과)
2001~2007 LG전자 책임/선임연구원
1998~2001 Univ. of Cambridge, Dept. of Engineering 박사 (분야: 음성인식)
1992~1998 KAIST 전산학과 학사/석사 (분야: 화자인식)
1990~1992 서울과학고등학교  




김찬우 부사장 (삼성 리서치)
Title: Overview of Speech Recognition Systems in Industry


In this talk, we give an overview of the contemporary end-to-end Automatic Speech Recognition (ASR) algorithms deployed in the industry. Conventional ASR systems consist of multiple handcrafted components. However, the introduction of fully neural sequence-to-sequence technologies has greatly simplified the structure while significantly improving the performance. In this talk, we give an overview of the important end-to-end ASR structures including a stack of neural network layers with a Connectionist Temporal Classification (CTC) loss, Recurrent Neural Network Transducer (RNN-T), Transformer Transducer, and Conformer Transducer (Conformer-T), and models based on Attention-based Encoder-Decoder (AED). We also discuss practical issues such as reducing latency, improving performance on named entities, and biasing ASR results for specific domains. Finally, we describe model compression techniques to reduce the memory footprint and computational cost.


Chanwoo Kim has been a corporate executive vice president at Samsung research leading the language and voice team. He joined Samsung research as a corporate vice president heading the speech processing Lab in Feb. 2018. He has been leading research on end-to-end speech recognition, end-to-end text-to-speech (TTS), machine translation, Natural Language Understanding (NLU), Language Modeling (LM) and Question Answering (QA), speech enhancement, key-word spotting, and so on at Samsung Research. Most of these research outcomes have been commercialized for Samsung products. He was a software engineer at the Google speech team between Feb. 2013 and Feb. 2018. He worked for acoustic modeling for speech recognition systems and enhancing noise robustness using deep learning techniques. While working for Google, he contributed to data-augmentation and acoustic modeling of Google speech recognition systems. He contributed to the commercialization of various Google AI speakers and google speech recognition systems. He was a speech scientist at Microsoft from Jan. 2011 to Jan. 2013. Dr. Kim received his Ph. D. from the Language Technologies Institute of School of Computer Science Carnegie Mellon University in Dec. 2010. He received his B.S and M.S. degrees in Electrical Engineering from Seoul National University in 1998 and 2001, respectively. Dr. Kim’s doctoral research was focused on enhancing the robustness of automatic speech recognition systems in noisy environments. Between 2003 and 2005 Dr. Kim was a Senior Research Engineer at LG Electronics, where he worked primarily on embedded signal processing and protocol stacks for multimedia systems. Prior to his employment at LG, he worked for EdumediaTek and SK Teletech as an R&D engineer. 



정준선 교수 (한국과학기술원)
Title: Self-supervised learning of audio and speech representations


Supervised learning with deep neural networks has brought phenomenal advances to many fields of research, but the performance of such systems relies heavily on the quality and quantity of annotated databases tailored to the particular application. It can be prohibitively difficult to manually collect and annotate databases for every task. There is a plethora of data on the internet that is not used in machine learning due to the lack of such annotations. Self-supervised learning allows a model to learn representations using properties inherent in the data itself, such as natural co-occurrence. In this talk, I will introduce recent works on self-supervised learning of audio and speech representations. Recent works demonstrates that phonetic and semantic representations of audio and speech can be learnt from unlabelled audio and video. The learnt representations can be used for downstream tasks such as automatic speech recognition, speaker recognition, face recognition and lip reading. Other noteworthy applications include localizing sound sources in images and separating simultaneous speech from video.  



Joon Son Chung is an assistant professor at the School of Electrical Engineering, KAIST, where he is directing Multimodal AI Lab. Previously, he was a research scientist at Naver Corporation, where he managed the development of speech recognition models for various applications including Clova Note. He received his BA and PhD from the University of Oxford, working with Prof. Andrew Zisserman. He published in top journals including TPAMI and IJCV, and has been the recipient of best paper awards at Interspeech and ACCV. His research interests include speaker recognition, cross-modal learning, visual speech synthesis and audio-visual speech recognition. 






이기복 교수  (연세대학교)
Title: Recent Advances in Self-Supervised Deep Visual Representation Learning


Representation learning is a fundamental task in machine learning as the success of machine learning relies on the quality of representation. Over the last decade, visual representation learning is usually done by supervised learning, taking deep convolutional neural network (CNN) features trained on a labeled large-scale image dataset. However, recent works in self-supervised learning (SSL) suggest that representation learning without labels is effective, often outperforming its supervised counterpart. In this talk, I will first summarize recent advances in self-supervised representation learning for visual recognition. Then, I will describe their limitations and discuss several ways to improve them.  



Kibok Lee is an assistant professor in the Department of Applied Statistics / Statistics and Data Science at Yonsei University. Previously he worked as an Applied Scientist at Amazon Web Services, Rekognition Team (AWS AI). He received his Ph.D. from the Department of Computer Science and Engineering at the University of Michigan in 2020, advised by Prof. Honglak Lee. His research focuses on machine learning and computer vision, which spans over deep representation learning, out-of-distribution detection, continual lifelong learning, and few-shot learning.