Special Topics in Large Language Models

Tuesdays and Thursdays, 04:00 PM to 05:15 PM, Appleby Hall 3

Course Information

Course Description: This graduate level special topics course examines emerging frontiers in large language models and their expanding roles across cognitive science, human AI interaction, and the social sciences. Students will explore state of the art research in areas such as cognitive architectures, reasoning and planning, compositionality, social cognition, and test time scaling, as well as applications of language models in domains including law, medicine, journalism, and scientific discovery.

Each student or team will select a focused topic, conduct a comprehensive literature review, and lead a seminar style lecture and discussion. The course culminates in a semester long research or implementation project, presented as a final paper and in class presentation. Projects should emphasize novel failure modes, under explored behaviors, or emerging risks rather than incremental performance gains.

Instructor: Dongyeop Kang (DK)

Meeting time: Tuesday and Thursday, 04:00 PM to 05:15 PM
Location: Appleby Hall 3 (campus map)
Homepage: dykang.github.io/classes/csci8980/S26/index.html
Canvas: canvas.umn.edu/courses/553602
Slack: csci8980s26.slack.com

Assessment

Student learning is assessed based on how effectively students identify, define, and solve research problems in their chosen topic.

Research Project (50%) originality, rigor, and clarity in identifying and addressing a novel research question
Topic Presentation (30%) analytical depth, synthesis of literature, and ability to communicate complex ideas effectively
Participation and Discussion (20%) quality of contributions to peer learning and critical engagement with readings

Participation and Discussion

Participation is evaluated based on the quality of contributions to seminar discussion and critical engagement with readings. Lecture recordings support class participation.

Come prepared to discuss the assigned readings
Ask precise questions and challenge assumptions constructively
Use Slack for clarifications and ongoing discussion

Course Schedule

We will cover a variety of machine-centered topics and human-centered topics. Based on knowledge you gain during the class, your team will conduct semester-long "research" project. Pay attention to due dates for projects.

Date	Topic and Focus	Presenters	Reading for Presentation	Other Reading (Mandatory)
01-20	Class Overview
01-22	Turntable Research Discussion
01-27	Test Time Scaling and Self Evolving Agents inference time emergence with focus on dynamic compute allocation and self improving agent systems Subtopics: Test time compute scaling laws; Adaptive inference and early exit; Self evolving agents and organizations; Latency cost reliability tradeoffs Research Directions: Dynamic inference controllers; Benchmarks for adaptive inference and agent self improvement; When search beats learned reasoning; Failure modes of self evolving agents	0127-TTS-A-Joel Markley-The Era of Agentic Organization: Learning to Organize with Language Models 0127-TTS-B-Priyasha-Learning to Discover at Test Time	AlphaEvolve: A coding agent for scientific and algorithmic discovery 2025 The Era of Agentic Organization: Learning to Organize with Language Models, 2025 Can 1B LLM Surpass 405B LLM? Rethinking Compute Efficient Test Time Scaling, ICLR 2025 Beyond Pipelines: A Survey of the Paradigm Shift toward Model native Agentic AI, 2025	LLMs as Agents A Survey of Self Evolving Agents: On Path to Artificial Super Intelligence 2025 A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive, ACL 2025 AgentEvolver: Towards Efficient Self Evolving Agent System, 2025
01-29		0129-TTS-A-Shady-AlphaEvolve 0129-TTS-B-JG-Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling		Scaling Unverifiable Rewards: A Case Study on Visual Insights 2025 s1: Simple test time scaling
02-03	Expert AI and Workflow Modeling How language models act in real domains over long time horizons with tools, memory, and complex environments Subtopics: Vertical AI; Web agents with real traces; Multi agent debate; Structured tool planning; Agent memory systems; Action or process level evaluation Research Directions: Epistemic agent memory design; Process based evaluation beyond task success; Reliability and safety of deployed agents	0203-Expert-A-Joel Markley-Future of Work with AI Agents 0203-Expert-B-Hejia Liu-The AI Scientist	The AI Scientist: Towards Fully Automated Open Ended Scientific Discovery 2024 Early science acceleration experiments with GPT 5, OpenAI 2025 Microsoft New Future of Work Report 2025 Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce Terminal Bench: A Benchmark for AI Agents in Terminal Environments 2025 LawFlow: Collecting and Simulating Lawyers' Thought Processes, COLM 2025	How Does Time Horizon Vary Across Domains? METR 2025 PaperBench: Evaluating AI's Ability to Replicate AI Research, OpenAI 2025 Virtuous Machines: Towards Artificial General Science 2025 A New Paradigm to Advance Physical Scientific Discovery, NeurIPS 2024
02-05		0205-Expert-A-Qiaozhi Huang-Terminal Bench: A Benchmark for AI Agents in Terminal Environments 2025 0205-Expert-B-Andrei Solodin-LawFlow		Towards an AI Augmented Textbook 2025 TheAgentCompany: Benchmarking LLM Agents on Realistic Professional Tasks, NeurIPS 2025 LiveResearchBench: A Live Benchmark for User Centric Deep Research in the Wild, 2025
02-10	DK Office Hour with Project Ideas
02-12	DK Office Hour with Project Ideas
02-17	Reasoning and Planning How can we enhance reasoning capabilities of language models and investigate how fragile they are under search and planning pressure Subtopics: Chain of thought emergence and failure; Reasoning faithfulness; Decoding versus search; Planning as language generation; Reasoning action tradeoffs Research Directions: Taxonomies of reasoning failures; Defenses against reasoning manipulation; Search augmented reasoning	0217-ReasonPlan-A-Shady-FlowRL 0217-ReasonPlan-B-Yanting-The danger of overthinking	Position: The Missing Layer of AGI: From Pattern Alchemy to Coordination Physics, 2025 ReasoningBank: Scaling Agent Self Evolving with Reasoning Memory 2025 Deep Think with Confidence, ICLR 2026 The Danger of Overthinking: Examining the Reasoning Action Dilemma in Agentic Tasks 2025 FlowRL: Matching Reward Distributions for LLM Reasoning 2025 Planning in Natural Language Improves LLM Search for Code Generation, ICLR 2025 Reinforcement Pre Training 2025	Reasoning Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps, EMNLP 2025 Search r1: Training LLMs to reason and leverage search engines with RL, COLM 2025
02-19		0219-ReasonPlan-A-Yuxiang-Deep Think with Confidence 0219-ReasonPlan-B-Riely Zong-Reinforcement Pre-Training		Large Language Models Cannot Self Correct Reasoning Yet, ICLR 2024
02-24	Data How data shapes model behavior long after training, including bias collapse and uncertainty Subtopics: Pretraining mixture bias; Post training data pathologies; Synthetic data feedback loops; Uncertainty calibration; Benchmark contamination Research Directions: Adversarial dataset construction; Selective generation control; Detecting data induced collapse; Synthetic workflow behavioral data	0224-Data-A-Chunlin-AI models collapse when trained on recursively generated data 0224-Data-B-Su Yeon Han-Language Models Resist Alignment	Language Models Resist Alignment: Evidence From Data Compression, ACL 2025 BeyondWeb: Lessons from Scaling Synthetic Data for Trillion scale Pretraining 2025 AI models collapse when trained on recursively generated data, Nature 2024 ScholaWrite: A Dataset of End to End Scholarly Writing Process 2025 From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence	Evaluation+Data Under the Surface: Tracking the Artifactuality of LLM Generated Data
02-26		0226-Data-A-{Presenter}-{TITLE} 0226-Data-B-{Presenter}-{TITLE}		Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation, EMNLP 2025
03-03	Evaluation Rethinking evaluation once benchmarks saturate under realistic and dynamic conditions Subtopics: Static versus dynamic benchmarks; LLM as judge bias; Distribution shift; Meta evaluation; Benchmarking real world, long term tasks; Benchmark saturation Research Directions: Dynamic benchmark design; Human versus machine evaluator disagreement; Long horizon agent evaluation	0303-Evaluation-A-{Presenter}-{TITLE} 0303-Evaluation-B-{Presenter}-{TITLE}	Measuring AI Ability to Complete Long Tasks, METR 2025 The Leaderboard Illusion 2025 Fluid Language Model Benchmarking, COLM 2025 ALE-Bench: A Benchmark for Long Horizon Objective Driven Algorithm Engineering, NeurIPS 2025 Human Centered Evaluation of Language Technologies Tutorial, EMNLP 2024	Measuring the performance of our models on real world tasks, OpenAI 2025 Benchmarking Cognitive Biases in Large Language Models as Evaluators, ACL 2024 Infini gram mini: Exact n gram Search at the Internet Scale with FM Index, EMNLP 2025
03-05		0305-Evaluation-A-Yiyuan-ALE-Bench 0305-Evaluation-B-Riely Zong-Measuring AI Ability to Complete Long Tasks		The BiGGen Bench: A Principled Benchmark for Fine Grained Evaluation of Language Models with Language Models, NAACL 2025 xbench: Tracking Agents Productivity Scaling with Profession Aligned Real World Evaluations 2025 Humanity's Last Exam 2025 EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees, COLM 2025 tinyBenchmarks: evaluating LLMs with fewer examples, ICML 2024 Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards? ACL 2021
03-10	Spring Break
03-12	Spring Break
03-17	Midterm Project Presentation
03-19	Midterm Project Presentation
03-24	No Class Conference travel
03-26	No Class Conference travel
03-31	Cognition of LLMs Whether language models exhibit cognition like properties analogous to humans Subtopics: Belief formation under context; Narrative understanding; Self improving reasoning behaviors; Foundations of intelligence Research Directions: Human like intelligence benchmarks; Human vs AI cognition comparison; Augmenting human cognition to models	0331-Cognition-A-Qiaozhi Huang-A foundation model to predict and capture human cognition 0331-Cognition-B-Yuxiang (Woody)-A Definition of AGI	Position: A Definition of AGI 2025 A foundation model to predict and capture human cognition, Nature 2025 Advances and Challenges in Foundation Agents: From Brain Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems 2025 Cognitive Behaviors that Enable Self Improving Reasoners, COLM 2025 Position: On Benchmarking Human Like Intelligence in Machines 2025 Dissociating language and thought in large language models	Artificial Cognition Accumulating Context Changes the Beliefs of Language Models 2025 Do Language Models Agree with Human Perceptions of Suspense in Stories? COLM 2025
04-02		0402-Cognition-A-Faaiq-On Benchmarking Human-Like Intelligence in Machines 0402-Cognition-B-Yanting-A foundation model to predict and capture human cognition		Strong Memory, Weak Control: An Empirical Study of Executive Functioning in LLMs, EACL 2026 Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation 2025
04-07	LLMs in the World: Society and Pluralism Interaction of language models with human values, institutions, cultures, and laws Subtopics: Pluralistic alignment; Reward hacking; Cultural preference drift; Legal accountability; Philosophy of mind Research Directions: Cross cultural alignment evaluation; Legal risk analysis; Dynamic multi value alignment	0407-World-A-Chunlin-Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking 0407-World-B-Andrei Solodin-Simulating Society Requires Simulating Thought	Position: Simulating Society Requires Simulating Thought, NeurIPS 2025 Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking, ICLR 2025 Artificial Hivemind: The Open Ended Homogeneity of Language Models, NeurIPS 2025 The PRISM Alignment Dataset, NeurIPS 2024	Alignment Position: A Roadmap to Pluralistic Alignment 2024 Mind the Value Action Gap: Do LLMs Act in Alignment with Their Values, EMNLP 2025
04-09		0409-World-A-Minseong Kweon-Artificial Hivemind 0409-World-B-Su Yeon Han-The PRISM Alignment Dataset		Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs, ACL 2025 Beyond Demographics: Aligning Role playing LLM based Agents Using Human Belief Networks, EMNLP 2024
04-14	Human AI Collaboration How language models reshape collaboration, creativity, productivity, and human cognition Subtopics: Multi agent coordination; Side effect from AI reliance; AI mediated deliberation; Productivity tools; Cognitive effects of AI use Research Directions: Empirical studies of collaboration; Agent role and coordination design	0414-HAI-A-Joel Markley-Examining Human-AI Collaboration for Co-Writing Constructive Comments Online 0414-HAI-B-Tyler Chan-How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations 2025	REL-AI: An Interaction Centered Approach to Measuring Human LM Reliance, NAACL 2025 How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations 2025 CollabLLM: From Passive Responders to Active Collaborators, ICML 2025 Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task 2025 HumanCreativity in the Age of LLMs 2025 Generative AI Can Harm Learning 2024 Examining Human-AI Collaboration for Co-Writing Constructive Comments Online MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance	Human-AI Collaboration Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition 2025 Personalization of Large Language Models: A Survey, TMLR 2025
04-16		0416-HAI-A-Tyler Chan-MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance 0416-HAI-B-Hejia Liu-Generative AI Can Harm Learning		AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text
04-21	Beyond Transformers Whether transformers are a bottleneck and what comes next architecturally Subtopics: Recursive and recurrent computation; Liquid networks; State space hybrids; Memory augmented models; Mixture of experts routing failures; Diffusion language models Research Directions: Routing collapse analysis; Long context benchmarks beyond attention	0421-BeyondTransformer-A-Yiyuan-Large Language Diffusion Model 0421-BeyondTransformer-B-Faaiq-Less is More	Large Language Diffusion Models, NeurIPS 2025 Less is More: Recursive Reasoning with Tiny Networks 2025 Native Sparse Attention: Hardware Aligned and Natively Trainable Sparse Attention, ACL 2025 Defeating Nondeterminism in LLM Inference 2025	Transfomers Pretraining and Scaling Gated Attention for Large Language Models: Non linearity, Sparsity, and Attention Sink Free, 2025 Repeat After Me: Transformers are Better than State Space Models at Copying, ICML 2024 Overcoming Long Context Limitations of State Space Models, ICML 2025
04-23		0423-BeyondTransformer-A-Minseong Kweon-Native Sparse Attention 0423-BeyondTransformer-B-JG-Large Language Diffusion Models NeurIPS 2025		Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training, NeurIPS 2025 Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions, ICLR 2025 Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality, ICML 2024
04-28	Final Project Presentation
04-30	Final Project Presentation

Reading

Topic Presentation (30m talk and 15m discussion)

Reading for Presentation is organized around topic blocks, with four papers assigned per block and typically two papers discussed in each class session. Each student is required to present a total of two to three papers over the course of the term. Students may indicate their topic and paper preferences using the provided interest form. Presenters should select papers across perspectives, with at least one paper from the human centered track and one from the machine centered track.

In depth survey and lecture: Each presenting student prepares a structured synthesis of the topic and leads the class through a lecture style presentation and discussion.
Required components: Presentations must include a curated reading list, summary of key technical ideas and takeaways, and a limitation or discussion effort when feasible.
Lecture format: Sessions follow a seminar style format, with presenters responsible for moderating discussion and encouraging critical engagement from the class.
Bonus Point: Fruitful discussions and insightful questions during presentations may be rewarded with bonus participation points at the instructor's discretion.

Class Projects

The course includes a semester long research or implementation project. Projects should go beyond surface level performance gains and instead focus on novel failure modes, under explored behaviors, or emerging risks of large language models or agentic systems. Incremental leaderboard gains or prompt only tweaks are discouraged. Projects may be completed individually or in small teams, maximum 2 members. Each project must be DK confirmed during DK Office Hour week (Feb 10, 12) and grounded in a clear research question, supported by relevant literature and empirical analysis.

Project Scope

Students are encouraged to produce one of the following research artifacts:

A benchmark or evaluation suite
A dataset capturing non trivial behaviors or failure modes
A measurement or diagnostic framework
A failure taxonomy or behavioral analysis
A mitigation, repair, or control algorithm

Deliverables

Each team must submit a written report, reproducible code or data, and presentation materials via Canvas. Reports should follow a standard conference paper format, ACL style preferred, using templates via Overleaf or GitHub.

Milestones and Timeline

The following milestones are aligned with the course schedule. Canvas submission links will be provided.

Team formation and project idea Due: early February
Project proposal (problem statement, related work, plan) Due: mid February (Canvas link TBD)
Midterm presentation In class: March 17 to 19
Final presentation In class: April 28 to 30
Final report and code submission Due: May 8 (Friday) (Canvas link TBD)

Projects are expected to be reproducible, clearly scoped, and analytically grounded. Strong projects typically combine careful problem formulation with diagnostic evaluation, stress testing, or controlled ablation studies. Evaluation emphasis: originality, rigor, and insight into model behavior under realistic, adversarial, or long horizon conditions rather than raw performance.

See evaluation rubric for final reports.

Selected Past Projects

Reference examples from past courses.

Simulating Everyone's Voice: Exploring ChatGPT's Ability to Simulate Human Annotators
Vision and Language guided Generalized Object Grasping
Generating Controllable Long dialogue with Coherence → Published in AAAI 2024
Understanding Narrative Transportation in Fantasy Fanfiction → Published in Workshop on Narrative Understanding (WNU) @ACL 2023