Restriction: Must be in the Computer Science Master's or Doctoral program; or permission of instructor.
This course focuses on state-of-the-art methods for interpreting neural language models and understanding their learned behaviors. We will discuss approaches centered on both understanding models internal mechanisms/representations and attributing behaviors back to the training data. We will focus on understanding model behaviors including hallucination, factuality, memorization, and explanation/reasoning elicitation. If time allows, we will discuss recent developments in ameliorating learned behaviors, such as model editing, unlearning, and steering. This is primarily a seminar course focused on paper readings and presentations.