This course covers the principles and key techniques of building and deploying large language models as well as multimodal foundation models (e.g., for images, audio, and video). Topics include Transformer architectures, multimodal representation learning, and diffusion-based generative models. The course also covers post-training and alignment (e.g., supervised fine-tuning, reinforcement learning), grounding with retrieval and tools, rigorous evaluation, and efficient deployment.