Loading...

ÇмúÇà»ç

Korean AI Association

  >   ÇмúÇà»ç   >   ±¹³»Çмú´ëȸ

±¹³»Çмú´ëȸ

¿¬»ç ¹× ÃÊ·Ï (18ÀÏ)
 
±èÁöȯ ±³¼ö (¼­°­´ëÇб³)
 
Title: Introduction to Speech Recognition
 

Abs

À½¼ºÀνÄÀÇ ±âº» °³³äÀ» ÀÌÇØÇÏ´Â °ÍÀ» ¸ñÇ¥·Î ÇÑ´Ù. Seq2Seq °üÁ¡¿¡¼­ À½¼ºÀνÄÀÇ ¹®Á¦¸¦ Á¤ÀÇÇÑ´Ù. µö·¯´× ±â¹ÝÀÇ À½¼ºÀÎ½Ä ÁÖ¿ä ±â¼úÀ» DNN-WFST, CTC, RNN-T, Seq2Seq, Tramsformer Áß½ÉÀ¸·Î »ìÆ캸°í, ¾ð¾î¸ðµ¨Àº n-gram Áß½ÉÀ¸·Î, µðÄÚµù ³×Æ®¿öÅ©´Â WFST Áß½ÉÀ¸·Î °øºÎÇÑ´Ù.

¼ö°­´ë»ó: À½¼ºÀÎ½Ä °øÇÐÀû ¹®Á¦ Á¤ÀÇ¿Í ¾Ë°í¸®ÁòÀº óÀ½ Á¢ÇÏ´Â ºÐ (´Ü, MLP, Back-propagationÀº ÀÌÇØÇÏ°í ÀÖ´Â °ÍÀ» ÀüÁ¦·Î ÁøÇàÇÔ, RNN, Seq-to-seq, attention, Transformer¿¡ ´ëÇØ ±âº» °³³äÀ» °øºÎÇÑ ÀûÀÌ ÀÖÀ¸¸é, °­ÀÇ ³»¿ë ÀÌÇØ¿¡ µµ¿òÀÌ µÊ)  

 

Bio

2007~ÇöÀç ¼­°­´ëÇб³ ±³¼ö (ÄÄÇ»ÅÍ°øÇаú/ÀΰøÁö´ÉÇаú)
2001~2007 LGÀüÀÚ Ã¥ÀÓ/¼±ÀÓ¿¬±¸¿ø
1998~2001 Univ. of Cambridge, Dept. of Engineering ¹Ú»ç (ºÐ¾ß: À½¼ºÀνÄ)
1992~1998 KAIST Àü»êÇаú Çлç/¼®»ç (ºÐ¾ß: È­ÀÚÀνÄ)
1990~1992 ¼­¿ï°úÇаíµîÇб³  

 

 

 


 
±èÂù¿ì ºÎ»çÀå (»ï¼º ¸®¼­Ä¡)
 
Title: Overview of Speech Recognition Systems in Industry
 

Abs

In this talk, we give an overview of the contemporary end-to-end Automatic Speech Recognition (ASR) algorithms deployed in the industry. Conventional ASR systems consist of multiple handcrafted components. However, the introduction of fully neural sequence-to-sequence technologies has greatly simplified the structure while significantly improving the performance. In this talk, we give an overview of the important end-to-end ASR structures including a stack of neural network layers with a Connectionist Temporal Classification (CTC) loss, Recurrent Neural Network Transducer (RNN-T), Transformer Transducer, and Conformer Transducer (Conformer-T), and models based on Attention-based Encoder-Decoder (AED). We also discuss practical issues such as reducing latency, improving performance on named entities, and biasing ASR results for specific domains. Finally, we describe model compression techniques to reduce the memory footprint and computational cost.
 

Bio

Chanwoo Kim has been a corporate executive vice president at Samsung research leading the language and voice team. He joined Samsung research as a corporate vice president heading the speech processing Lab in Feb. 2018. He has been leading research on end-to-end speech recognition, end-to-end text-to-speech (TTS), machine translation, Natural Language Understanding (NLU), Language Modeling (LM) and Question Answering (QA), speech enhancement, key-word spotting, and so on at Samsung Research. Most of these research outcomes have been commercialized for Samsung products. He was a software engineer at the Google speech team between Feb. 2013 and Feb. 2018. He worked for acoustic modeling for speech recognition systems and enhancing noise robustness using deep learning techniques. While working for Google, he contributed to data-augmentation and acoustic modeling of Google speech recognition systems. He contributed to the commercialization of various Google AI speakers and google speech recognition systems. He was a speech scientist at Microsoft from Jan. 2011 to Jan. 2013. Dr. Kim received his Ph. D. from the Language Technologies Institute of School of Computer Science Carnegie Mellon University in Dec. 2010. He received his B.S and M.S. degrees in Electrical Engineering from Seoul National University in 1998 and 2001, respectively. Dr. Kim’s doctoral research was focused on enhancing the robustness of automatic speech recognition systems in noisy environments. Between 2003 and 2005 Dr. Kim was a Senior Research Engineer at LG Electronics, where he worked primarily on embedded signal processing and protocol stacks for multimedia systems. Prior to his employment at LG, he worked for EdumediaTek and SK Teletech as an R&D engineer. 

 


 

Á¤Áؼ± ±³¼ö (Çѱ¹°úÇбâ¼ú¿ø)
 
Title: Self-supervised learning of audio and speech representations
 

Abs

Supervised learning with deep neural networks has brought phenomenal advances to many fields of research, but the performance of such systems relies heavily on the quality and quantity of annotated databases tailored to the particular application. It can be prohibitively difficult to manually collect and annotate databases for every task. There is a plethora of data on the internet that is not used in machine learning due to the lack of such annotations. Self-supervised learning allows a model to learn representations using properties inherent in the data itself, such as natural co-occurrence. In this talk, I will introduce recent works on self-supervised learning of audio and speech representations. Recent works demonstrates that phonetic and semantic representations of audio and speech can be learnt from unlabelled audio and video. The learnt representations can be used for downstream tasks such as automatic speech recognition, speaker recognition, face recognition and lip reading. Other noteworthy applications include localizing sound sources in images and separating simultaneous speech from video.  

 

Bio

Joon Son Chung is an assistant professor at the School of Electrical Engineering, KAIST, where he is directing Multimodal AI Lab. Previously, he was a research scientist at Naver Corporation, where he managed the development of speech recognition models for various applications including Clova Note. He received his BA and PhD from the University of Oxford, working with Prof. Andrew Zisserman. He published in top journals including TPAMI and IJCV, and has been the recipient of best paper awards at Interspeech and ACCV. His research interests include speaker recognition, cross-modal learning, visual speech synthesis and audio-visual speech recognition. 

 

 

 

 

 


 
 
À̱⺹ ±³¼ö  (¿¬¼¼´ëÇб³)
 
Title: Recent Advances in Self-Supervised Deep Visual Representation Learning
 

Abs

Representation learning is a fundamental task in machine learning as the success of machine learning relies on the quality of representation. Over the last decade, visual representation learning is usually done by supervised learning, taking deep convolutional neural network (CNN) features trained on a labeled large-scale image dataset. However, recent works in self-supervised learning (SSL) suggest that representation learning without labels is effective, often outperforming its supervised counterpart. In this talk, I will first summarize recent advances in self-supervised representation learning for visual recognition. Then, I will describe their limitations and discuss several ways to improve them.  

 

Bio:   

Kibok Lee is an assistant professor in the Department of Applied Statistics / Statistics and Data Science at Yonsei University. Previously he worked as an Applied Scientist at Amazon Web Services, Rekognition Team (AWS AI). He received his Ph.D. from the Department of Computer Science and Engineering at the University of Michigan in 2020, advised by Prof. Honglak Lee. His research focuses on machine learning and computer vision, which spans over deep representation learning, out-of-distribution detection, continual lifelong learning, and few-shot learning.