Special Topics in Large Language Models

Tuesdays and Thursdays, 04:00 PM to 05:15 PM, Appleby Hall 3

Course Information

Course Description: This graduate level special topics course examines emerging frontiers in large language models and their expanding roles across cognitive science, human AI interaction, and the social sciences. Students will explore state of the art research in areas such as cognitive architectures, reasoning and planning, compositionality, social cognition, and test time scaling, as well as applications of language models in domains including law, medicine, journalism, and scientific discovery.

Each student or team will select a focused topic, conduct a comprehensive literature review, and lead a seminar style lecture and discussion. The course culminates in a semester long research or implementation project, presented as a final paper and in class presentation. Projects should emphasize novel failure modes, under explored behaviors, or emerging risks rather than incremental performance gains.

Instructor: Dongyeop Kang (DK)

Meeting time: Tuesday and Thursday, 04:00 PM to 05:15 PM
Location: Appleby Hall 3 (campus map)
Homepage: dykang.github.io/classes/csci8980/S26/index.html
Canvas: canvas.umn.edu/courses/553602
Slack: csci8980s26.slack.com

Assessment

Student learning is assessed based on how effectively students identify, define, and solve research problems in their chosen topic.

  • Research Project (50%) originality, rigor, and clarity in identifying and addressing a novel research question
  • Topic Presentation (30%) analytical depth, synthesis of literature, and ability to communicate complex ideas effectively
  • Participation and Discussion (20%) quality of contributions to peer learning and critical engagement with readings

Participation and Discussion

Participation is evaluated based on the quality of contributions to seminar discussion and critical engagement with readings. Lecture recordings support class participation.

  • Come prepared to discuss the assigned readings
  • Ask precise questions and challenge assumptions constructively
  • Use Slack for clarifications and ongoing discussion

Course Schedule

We will cover a variety of machine-centered topics and human-centered topics. Based on knowledge you gain during the class, your team will conduct semester-long "research" project. Pay attention to due dates for projects.

Date Topic and Focus Presenters Reading for Presentation Other Reading (Mandatory)
01-20 Class Overview
01-22 Turntable Research Discussion
01-27 Test Time Scaling and Self Evolving Agents
inference time emergence with focus on dynamic compute allocation and self improving agent systems
Subtopics: Test time compute scaling laws; Adaptive inference and early exit; Self evolving agents and organizations; Latency cost reliability tradeoffs
Research Directions: Dynamic inference controllers; Benchmarks for adaptive inference and agent self improvement; When search beats learned reasoning; Failure modes of self evolving agents
  • LLMs as Agents
  • A Survey of Self Evolving Agents: On Path to Artificial Super Intelligence 2025
  • A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive, ACL 2025
  • AgentEvolver: Towards Efficient Self Evolving Agent System, 2025
01-29
  • Scaling Unverifiable Rewards: A Case Study on Visual Insights 2025
  • s1: Simple test time scaling
02-03 Expert AI and Workflow Modeling
How language models act in real domains over long time horizons with tools, memory, and complex environments
Subtopics: Vertical AI; Web agents with real traces; Multi agent debate; Structured tool planning; Agent memory systems; Action or process level evaluation
Research Directions: Epistemic agent memory design; Process based evaluation beyond task success; Reliability and safety of deployed agents
  • How Does Time Horizon Vary Across Domains? METR 2025
  • PaperBench: Evaluating AI's Ability to Replicate AI Research, OpenAI 2025
  • Virtuous Machines: Towards Artificial General Science 2025
  • A New Paradigm to Advance Physical Scientific Discovery, NeurIPS 2024
02-05
  • Towards an AI Augmented Textbook 2025
  • TheAgentCompany: Benchmarking LLM Agents on Realistic Professional Tasks, NeurIPS 2025
  • LiveResearchBench: A Live Benchmark for User Centric Deep Research in the Wild, 2025
02-10 DK Office Hour with Project Ideas
02-12
02-17 Reasoning and Planning
How can we enhance reasoning capabilities of language models and investigate how fragile they are under search and planning pressure
Subtopics: Chain of thought emergence and failure; Reasoning faithfulness; Decoding versus search; Planning as language generation; Reasoning action tradeoffs
Research Directions: Taxonomies of reasoning failures; Defenses against reasoning manipulation; Search augmented reasoning
  • Reasoning
  • Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps, EMNLP 2025
  • Search r1: Training LLMs to reason and leverage search engines with RL, COLM 2025
02-19
  • Large Language Models Cannot Self Correct Reasoning Yet, ICLR 2024
02-24 Data
How data shapes model behavior long after training, including bias collapse and uncertainty
Subtopics: Pretraining mixture bias; Post training data pathologies; Synthetic data feedback loops; Uncertainty calibration; Benchmark contamination
Research Directions: Adversarial dataset construction; Selective generation control; Detecting data induced collapse; Synthetic workflow behavioral data
  • Evaluation+Data
  • Under the Surface: Tracking the Artifactuality of LLM Generated Data
02-26
  • Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation, EMNLP 2025
03-03 Evaluation
Rethinking evaluation once benchmarks saturate under realistic and dynamic conditions
Subtopics: Static versus dynamic benchmarks; LLM as judge bias; Distribution shift; Meta evaluation; Benchmarking real world, long term tasks; Benchmark saturation
Research Directions: Dynamic benchmark design; Human versus machine evaluator disagreement; Long horizon agent evaluation
  • Measuring the performance of our models on real world tasks, OpenAI 2025
  • Benchmarking Cognitive Biases in Large Language Models as Evaluators, ACL 2024
  • Infini gram mini: Exact n gram Search at the Internet Scale with FM Index, EMNLP 2025
03-05
  • The BiGGen Bench: A Principled Benchmark for Fine Grained Evaluation of Language Models with Language Models, NAACL 2025
  • xbench: Tracking Agents Productivity Scaling with Profession Aligned Real World Evaluations 2025
  • Humanity's Last Exam 2025
  • EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees, COLM 2025
  • tinyBenchmarks: evaluating LLMs with fewer examples, ICML 2024
  • Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards? ACL 2021
03-10 Spring Break
03-12 Spring Break
03-17 Midterm Project Presentation
03-19 Midterm Project Presentation
03-24 No Class
Conference travel
03-26 No Class
Conference travel
03-31 Cognition of LLMs
Whether language models exhibit cognition like properties analogous to humans
Subtopics: Belief formation under context; Narrative understanding; Self improving reasoning behaviors; Foundations of intelligence
Research Directions: Human like intelligence benchmarks; Human vs AI cognition comparison; Augmenting human cognition to models
  • Artificial Cognition
  • Accumulating Context Changes the Beliefs of Language Models 2025
  • Do Language Models Agree with Human Perceptions of Suspense in Stories? COLM 2025
04-02
  • Strong Memory, Weak Control: An Empirical Study of Executive Functioning in LLMs, EACL 2026
  • Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation 2025
04-07 LLMs in the World: Society and Pluralism
Interaction of language models with human values, institutions, cultures, and laws
Subtopics: Pluralistic alignment; Reward hacking; Cultural preference drift; Legal accountability; Philosophy of mind
Research Directions: Cross cultural alignment evaluation; Legal risk analysis; Dynamic multi value alignment
  • Alignment
  • Position: A Roadmap to Pluralistic Alignment 2024
  • Mind the Value Action Gap: Do LLMs Act in Alignment with Their Values, EMNLP 2025
04-09
  • Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs, ACL 2025
  • Beyond Demographics: Aligning Role playing LLM based Agents Using Human Belief Networks, EMNLP 2024
04-14 Human AI Collaboration
How language models reshape collaboration, creativity, productivity, and human cognition
Subtopics: Multi agent coordination; Side effect from AI reliance; AI mediated deliberation; Productivity tools; Cognitive effects of AI use
Research Directions: Empirical studies of collaboration; Agent role and coordination design
  • Human-AI Collaboration
  • Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition 2025
  • Personalization of Large Language Models: A Survey, TMLR 2025
04-16
  • AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text
04-21 Beyond Transformers
Whether transformers are a bottleneck and what comes next architecturally
Subtopics: Recursive and recurrent computation; Liquid networks; State space hybrids; Memory augmented models; Mixture of experts routing failures; Diffusion language models
Research Directions: Routing collapse analysis; Long context benchmarks beyond attention
  • Transfomers
  • Pretraining and Scaling
  • Gated Attention for Large Language Models: Non linearity, Sparsity, and Attention Sink Free, 2025
  • Repeat After Me: Transformers are Better than State Space Models at Copying, ICML 2024
  • Overcoming Long Context Limitations of State Space Models, ICML 2025
04-23
  • Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training, NeurIPS 2025
  • Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions, ICLR 2025
  • Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality, ICML 2024
04-28 Final Project Presentation
04-30 Final Project Presentation

Reading

Topic Presentation (30m talk and 15m discussion)

Reading for Presentation is organized around topic blocks, with four papers assigned per block and typically two papers discussed in each class session. Each student is required to present a total of two to three papers over the course of the term. Students may indicate their topic and paper preferences using the provided interest form. Presenters should select papers across perspectives, with at least one paper from the human centered track and one from the machine centered track.

  • In depth survey and lecture: Each presenting student prepares a structured synthesis of the topic and leads the class through a lecture style presentation and discussion.
  • Required components: Presentations must include a curated reading list, summary of key technical ideas and takeaways, and a limitation or discussion effort when feasible.
  • Lecture format: Sessions follow a seminar style format, with presenters responsible for moderating discussion and encouraging critical engagement from the class.
  • Bonus Point: Fruitful discussions and insightful questions during presentations may be rewarded with bonus participation points at the instructor's discretion.

Class Projects

The course includes a semester long research or implementation project. Projects should go beyond surface level performance gains and instead focus on novel failure modes, under explored behaviors, or emerging risks of large language models or agentic systems. Incremental leaderboard gains or prompt only tweaks are discouraged. Projects may be completed individually or in small teams, maximum 2 members. Each project must be DK confirmed during DK Office Hour week (Feb 10, 12) and grounded in a clear research question, supported by relevant literature and empirical analysis.

Project Scope

Students are encouraged to produce one of the following research artifacts:

  • A benchmark or evaluation suite
  • A dataset capturing non trivial behaviors or failure modes
  • A measurement or diagnostic framework
  • A failure taxonomy or behavioral analysis
  • A mitigation, repair, or control algorithm

Deliverables

Each team must submit a written report, reproducible code or data, and presentation materials via Canvas. Reports should follow a standard conference paper format, ACL style preferred, using templates via Overleaf or GitHub.

Milestones and Timeline

The following milestones are aligned with the course schedule. Canvas submission links will be provided.

  • Team formation and project idea Due: early February
  • Project proposal (problem statement, related work, plan) Due: mid February (Canvas link TBD)
  • Midterm presentation In class: March 17 to 19
  • Final presentation In class: April 28 to 30
  • Final report and code submission Due: May 8 (Friday) (Canvas link TBD)

Projects are expected to be reproducible, clearly scoped, and analytically grounded. Strong projects typically combine careful problem formulation with diagnostic evaluation, stress testing, or controlled ablation studies. Evaluation emphasis: originality, rigor, and insight into model behavior under realistic, adversarial, or long horizon conditions rather than raw performance.

See evaluation rubric for final reports.

Selected Past Projects

Reference examples. Links will be added when available.

  • Simulating Everyone's Voice: Exploring ChatGPT's Ability to Simulate Human Annotators (report and poster TBD)
  • Vision and Language guided Generalized Object Grasping (report and poster TBD)
  • Generating Controllable Long dialogue with Coherence (Published in AAAI 2024, link TBD)
  • Understanding Narrative Transportation in Fantasy Fanfiction (ACL Workshop on Narrative Understanding, link TBD)