> ÇмúÇà»ç > ±¹³»Çмú´ëȸ
Human learning is inherently multimodal, encompassing observation, listening, reading, and communication to understand and learn from our environment. Significant advancements in machine learning fields relevant to these multimodal interactions, such as Speech Recognition and Computer Vision, have enabled the computational modeling of this innate learning process. Multimodal commonsense reasoning on massive web videos closely mirrors this approach. In this presentation, I will discuss my recent work on curating multimodal datasets and developing Multimodal LLMs. Specifically, I will focus on foundational models that integrate the training of various tasks in vision and language understanding. Additionally, I will extend this work to multimodal commonsense reasoning, which not only involves perception but also provides explanations and facilitates communication based on video understanding. To this end, I will explore multimodal foundation models incorporating self-judgment to improve video understanding and commonsense reasoning.
Bio:
Youngjae Yu is an Assistant Professor of Artificial Intelligence at Yonsei University, focusing on computer vision, natural language processing, and multimodal learning. Before joining Yonsei, he was a researcher at the Allen Institute for AI (AI2). He received his Ph.D. and B.S. in Computer Science and Engineering from Seoul National University. His research interests include video understanding and large language models, with a particular focus on large-scale video dataset curation for multimodal foundation models. His work has earned recognition, including the Best Paper Award at NAACL 2022, as well as two Outstanding Paper Awards at EMNLP 2023 and ACL 2024.