> 학술행사 > 국내학술대회
음성인식의 기본 개념을 이해하는 것을 목표로 한다. Seq2Seq 관점에서 음성인식의 문제를 정의한다. 딥러닝 기반의 음성인식 주요 기술을 DNN-WFST, CTC, RNN-T, Seq2Seq, Tramsformer 중심으로 살펴보고, 언어모델은 n-gram 중심으로, 디코딩 네트워크는 WFST 중심으로 공부한다.
수강대상: 음성인식 공학적 문제 정의와 알고리즘은 처음 접하는 분 (단, MLP, Back-propagation은 이해하고 있는 것을 전제로 진행함, RNN, Seq-to-seq, attention, Transformer에 대해 기본 개념을 공부한 적이 있으면, 강의 내용 이해에 도움이 됨)
Supervised learning with deep neural networks has brought phenomenal advances to many fields of research, but the performance of such systems relies heavily on the quality and quantity of annotated databases tailored to the particular application. It can be prohibitively difficult to manually collect and annotate databases for every task. There is a plethora of data on the internet that is not used in machine learning due to the lack of such annotations. Self-supervised learning allows a model to learn representations using properties inherent in the data itself, such as natural co-occurrence. In this talk, I will introduce recent works on self-supervised learning of audio and speech representations. Recent works demonstrates that phonetic and semantic representations of audio and speech can be learnt from unlabelled audio and video. The learnt representations can be used for downstream tasks such as automatic speech recognition, speaker recognition, face recognition and lip reading. Other noteworthy applications include localizing sound sources in images and separating simultaneous speech from video.
Representation learning is a fundamental task in machine learning as the success of machine learning relies on the quality of representation. Over the last decade, visual representation learning is usually done by supervised learning, taking deep convolutional neural network (CNN) features trained on a labeled large-scale image dataset. However, recent works in self-supervised learning (SSL) suggest that representation learning without labels is effective, often outperforming its supervised counterpart. In this talk, I will first summarize recent advances in self-supervised representation learning for visual recognition. Then, I will describe their limitations and discuss several ways to improve them.