The future of Artificial Intelligence demands a paradigm shift towards multimodal perception, enabling systems to interpret and fuse information from diverse sensory inputs. While we humans perceive the world by looking, listening, touching, smelling, and tasting, tradit form of machine intelligence has primarily focused on a single sensory modality, often vision. To truly understand the world around us, AI must learn to jointly interpret multimodal signals. This graduate-level seminar course explores computer vision from a multimodal perspective, focusing on learning algorithms that augment vision with other essentiamodalities, such as audio, touch, language, and more. The majority of the course will consist of student presentations, experiments, and paper discussions, and we will delve into the latest research and advancements in multimodal perception.