CSCI 5541 NLP

Spring 2024, Tuesday and Thursday, 2:30PM - 3:45PM, Akerman Hall 319


Course Information


Summary The purpose of this course is to provide an overview of the computational techniques developed to enable computers to interpret and respond appropriately to ideas expressed using natural languages, rather than formal languages, such as C++ or Python. This course will cover text classification, distributional representation methods of language, large language models, and advanced techniques in chatGPT. The course will cover a wide range of topics related to NLP, including theories, computational models, and applications with their societal and ethical impacts. Prerequisite: Maturity in linear algebra, calculus, and basic probability. Familiarity with Python. 5521 (recommended) or grad,

Natural Language Processing (NLP) is an interdisciplinary field that is based on theories in linguistics, cognitive science, and social science. The main focus of NLP is building computational models for applications such as machine translation and dialogue systems that can then interact with real users. Research and development in NLP therefore also includes considering important issues related to real-world AI systems, such as bias, controllability, interpretability, and ethics. This course will cover a broad range of topics related to NLP, from theories to computational models and applications to data annotation and evaluation. Students will read papers on those topics, create an annotated dataset, and implement algorithms on applications they are interested in. There will be a semester-long class project where you collect your own dataset, ensure it is accurate, develop a model using existing computing tools, evaluate the system, and consider its ethical and societal impacts.

The grade will be evaluated based on the course project, participation, and programming and reading assignments. All class material will be posted on the class site. We will use Canvas for homework and project submissions and grading, and Slack for discussion and QA. Email inquiries will be not be replied.

Instructors

Class meets
Tuesday and Thursday, 2:30PM to 3:45PM, Akerman Hall 319
Office hours
DK: Friday, 4pm - 4:30pm in Shepherd 259
Karin: Monday, 3:30pm - 4pm via Zoom
Zae: Wednesday, 3pm - 3:30pm via Zoom

Class page
https://dykang.github.io/classes/csci5541/S24
Slack
csci5541s24.slack.com/
Canvas
canvas.umn.edu/courses/413172

Grading and Late Policy

Grading

Late policy for deliverables

Each student will be granted 2 late days to use for homeworks over the duration of the semester. After all free late days are used up, penalty is 1 point for each additional late day. The late days and penalty will be applied to all team members for group homework and project.

Schedule

We will cover basic NLP representations f(x) to build text classifiers P(y|f(x)) , language models P(f(x)), and large language models P(f(x)). Based on knowledge you gain during the class, your team will develop your own NLP systems during the semester-long project. Pay attention to due dates and homework/reading assignment release. Lecture slides and homework/project description will be available in .

Date Lectures and Dues Readings
Jan 16 Class Overview
Jan 18 Intro to NLP
HW0 out
Jan 23 Text Classification (1)
Tutorial on Scikit-Learn (1) (Zae)
Jan 25 Text Classification (2)
Tutorial on PyTorch (2) (Zae)
HW0 due
Jan 30 Tutorial on Finetuningr (Karin)
Tutorial on HuggingFace (Karin)
HW1 out
Feb 1 Lexical Semantics
Project description out
Feb 6 Distributional Semantics and Word Vectors
Feb 8 Language Models (1): Ngram LM, Neural LM
HW1 due
HW2 out
Feb 13 Language Models (2): RNNs, LSTMs and Sequence-to-Sequence
Feb 15 Project Guideline
Team formation due
Feb 20 No Class (AAAI)
Feb 27 Language Models (4): Evaluation and Applications
HW3 out
Feb 29 Contextualized Word Embeddings
Project brainstorming due
Mar 5 No Class (Spring Break)
Mar 7 No Class (Spring Break)
Mar 12 Project Proposal Pitch (1)
Slides Deck for Group A
Group A:
  • Lexical decryptors (Sandeep Bhuiya, Mudit Jantwal, Hemanth Kumar Tirupati, Shreya Yashodhar) (Mentor: DK)
  • NLPros (Benjamin Kosieradzki, Mitchell Kosieradzki) (Mentor: Karin)
  • pnlp fiction (Rakshithaa Kanakarajan Selvarathinam, Vishal Kancharla, Sri Krishna Vamsi Koneru, Devansh Mishra) (Mentor: Karin)
  • Syntax Errors (Clayton Carlson, Ryan Diaz, Charlie Rapheal, Sanjali Roy) (Mentor: DK)
  • Tattered-animals (Zheng Robert Jia, Brandon Nee, Andrei Solodin, Jack Swanberg (Mentor: Zae)
  • WordWizards (Morgan Bozeman, Tyler Cook, Connor Holm, Derek Wong) (Mentor: Karin)
  • EdgeCaseWizards (Apekshik Panigrahi, Anna Terzian) (Mentor: DK)
  • Team bRockoLee (Jithendra Jagannatha Kagathi, Arjun Thonoor, Anirudh Vasudevan, Ya-Hui Yang) (Mentor: Zae)
  • Team SOTA (Evan Way, Jerry Yin, Zaifu Zhan, Zhongxing Zhang) (Mentor: DK)
Mar 14 Project Proposal Pitch (2)
Slides Deck for Group B
Group B:
  • Caught with N-grams (Michael Bronstein, Yichen Li, Lavanya Radhakrishnan, Wuhao Zhang) (Mentor: DK)
  • Cybertron (Gehna Jain, Ryan Langman, Trae Primm, Swapnil Puranik) (Mentor: Zae)
  • Fury GPT (Nikil Krishnakumar, Sujeendra Ramesh, Rammesh Adhav Saravanan) (Mentor: DK)
  • NLP Ninjas (Nirshal Chandra Sekar, Byeongchan Jeong, Fidan Mahmudova, Sheshasai Sairam) (Mentor: Karin)
  • SpotHRI (Adam Imdieke) (Mentor: Karin)
  • Transformers (Ritwick Banerjee, Madhan Mohan, Leyan Sayeh, Masha Volkova) (Mentor: Zae)
  • NLPitch (Dhondup Dolma, Jaeeun Lee, Yongtian Ou, Jiyoon Pyo) (Mentor: Zae)
  • PRWZ (Zaccheri Ciampone, Wyatt Kormick, Raymond Lyon, Preston Zhu) (Mentor: Karin)
  • Too Long; Didn’t Read (Ryan Johnsen, Dylan Paulson, Logan Schaaf, Tony Zhang) (Mentor: Zae)
Mar 21 Transformers (1)
Project proposal due
Mar 26 Transformers (2)
HW3 due
Mar 28 Pretraining and Scaling Laws
RA3 out
Apr 2 Prompting
HW4 out
Apr 4 Instructing and augmenting LLMs (Zae)
Apr 9 Ethics and Safety (Karin)
Project midterm office-hour due
Apr 11 Compute efficiency and engineering (James)
Apr 16 All about Data and Annotation
HW4 due (April 17)
Apr 18 Human-centric NLP
Concluding Remark
  • TBD
Apr 23 Final Project Poster (1)
Project poster due
Posters for Group B
  • Caught with N-grams (Michael Bronstein, Yichen Li, Lavanya Radhakrishnan, Wuhao Zhang) Automated Detection and Refutation of Climate-related Misinformation
  • Cybertron (Gehna Jain, Ryan Langman, Trae Primm, Swapnil Puranik) Evaluation of Knowledge Graphs in LLMs
  • Fury GPT (Nikil Krishnakumar, Sujeendra Ramesh, Rammesh Adhav Saravanan) User Friendly Drone Control: Fine-Tuning Language Models with ROS Commands for Real-World Application
  • NLP Ninjas (Nirshal Chandra Sekar, Byeongchan Jeong, Fidan Mahmudova, Sheshasai Sairam) Human - Robot Interaction using LLM
  • SpotHRI (Adam Imdieke) A Language Interface for the Spot Robot
  • Transformers (Ritwick Banerjee, Madhan Mohan, Leyan Sayeh, Masha Volkova) Disease Diagnosis using LLM
  • NLPitch (Dhondup Dolma, Jaeeun Lee, Yongtian Ou, Jiyoon Pyo) Transidiomation: Optimizing translation of idioms embedded in text
  • PRWZ (Zaccheri Ciampone, Wyatt Kormick, Raymond Lyon, Preston Zhu) Research Paper Simplification
  • Too Long; Didn’t Read (Ryan Johnsen, Dylan Paulson, Logan Schaaf, Tony Zhang) Cuisine Fusion Recipe Generato
Apr 25 Final Project Poster (2)
Project final report due (May 3, Friday)
Posters for Group A
  • Lexical decryptors (Sandeep Bhuiya, Mudit Jantwal, Hemanth Kumar Tirupati, Shreya Yashodhar) Scientific Text Simplification
  • NLPros (Benjamin Kosieradzki, Mitchell Kosieradzki) Evaluating the Boundary between In-Context Learning and Fine-Tuning
  • pnlp fiction (Rakshithaa Kanakarajan Selvarathinam, Vishal Kancharla, Sri Krishna Vamsi Koneru, Devansh Mishra) Identifying Bias in LLMs when using LRLs
  • Syntax Errors (Clayton Carlson, Ryan Diaz, Charlie Rapheal, Sanjali Roy) Leveraging Language Models for Temporal Political Bias Analysis
  • Tattered-animals (Zheng Robert Jia, Brandon Nee, Andrei Solodin, Jack Swanberg LLM Prompt Recovery
  • WordWizards (Morgan Bozeman, Tyler Cook, Connor Holm, Derek Wong) Performance of LLMs in Various Styles
  • EdgeCaseWizards (Apekshik Panigrahi, Anna Terzian) Terraform - Automating Infrastructure as a Service
  • Team bRockoLee (Jithendra Jagannatha Kagathi, Arjun Thonoor, Anirudh Vasudevan, Ya-Hui Yang) Transpilation
  • Team SOTA (Evan Way, Jerry Yin, Zaifu Zhan, Zhongxing Zhang) PaperHelper: Knowledge-Based LLM QA Paper Reading Assistant

Homework Details (50%)

All questions regarding homework MUST be communicated with the lead TA over Slack homework channels (e.g., #hw1, #hw2) or in-person during their office hours. Homework 1 and 2 should be done individually, while homework 3 and 4 are team-based (maximum of 4 people). Your team for homework 3 and 4 will be the same for the project team.

The use of outside resources (books, research papers, websites, etc.) or collaboration (students, professors, chatGPT, etc.) must be explicitly acknowledged in your report. Check out the notes for academic intergrity.

The deadline for all homework is by Friday midnight (11:59PM) of the due date. Since our schedule is quite tight, there will be no deadline extension, but you can still use your late days. For the delayed team homework (hw3, hw4), late days for every team member will be used. Check out the homework description and link to canvas for submission:

  • HW0: Building MLP-based text classifier with pytorch (0 points, Individual, due: Jan 26) (, )
  • HW1: Finetuning text classifier using HuggingFace (15 points, Individual, due: Feb 9) (, )
  • HW2: Authorship attribution using ngram language models (LMs) (15 points, Individual, due: Feb 23) (, )
  • HW3: Generating and evaluating text from pretrained LMs (10 points, Team, due: Mar 10) (, )
  • HW4: Prompting with large language models (LLMs) (10 points, Team, due: Apr 17) (, )

Project Details (30%)

First, carefully read the project description , as most project information, dues, rubric, and answers to your questions are in the description document. It is your responsbililty to miss any information regarding the project.

Your team (maximum of 4 people) should submit their report, link to code (or a zipped code), and presentation slides/poster to Canvas before the deadline. Use official ACL style templates (Overleaf or links). Here are some dues you have to submit for project (note that some dues are during week days):

  • Team formation (1 point, due: Feb 16) ()
  • Project brainstorming (1 point, due: Mar 1) ()
  • Proposal pitch (3 points, due: March 12 and 14) (Slides decks for Group A and Group B)
  • Proposal report (5 points, due: Mar 19) ()
  • Midterm office hour participation (5 points, due: Apr 5) ()
  • Poster presetnation (5 points, due: Apr 23 and 25) ()
  • Final report (10 points, due: May 3) () (evaluation rubric)

You can find some selected project reports and posters from the previous years' NLP classes below. Some projects are extended and published top-tier workshop and conferences:

  • [CSCI 5541 F23] Title Generation for Fictional Stories
  • [CSCI 5541 S23] Simulating Everyone's Voice: Exploring ChatGPTs Ability to Simulate Human Annotators
  • [CSCI 5541 S23] Vision & Language-guided Generalized Object Grasping
  • [CSCI 5541 S23] Generalizability of FLAN-T5 Model Using Composite Task Prompting
  • [CSCI 5541 S23] Comparing the Effectiveness of Fine-tuning vs. One-Shot Learning on the Kidz Bopification Task
  • [CSCI 5980 F22] Generating Controllable Long-dialogue with Coherence Published in AAAI 2024
  • [CSCI 8980 S22] Understanding Narrative Transportation in Fantasy Fanfiction Published in Workshop on Narrative Understanding (WNU) @ACL 2023

Reading Details (15%)

For each reading assignment, you will choose one paper from the readling list from the lectures before the deadline, and submit a short (1-page) summary to Canvas (), including the following information:

  • Paper title
  • An overview of the paper with novel contributions and major findings
  • Weakness of the proposed method
  • Ideas for potential improvements and general thoughts
For reading assignment #3, you should submit your essay for one of these questions

The deadline and canvas link are as follows:

  • Reading assignment #1 (5 points, due: Feb 16) ()
  • Reading assignment #2 (5 points, due: Mar 22) ()
  • Reading assignment #3 (5 points, due: Apr 19) ( )

Class Participation (5%)

Your class participation is thoroughly evaluated. Put your profile picture on Canvas and Slack so we can match you for the final evaluation. The following metrics will be used to grade your participation:
  • Participation and discussion in class
  • Discussion on Slack and during Office Hours for both instructor and TAs
  • Discussion and QA during the presentation of the project proposal and poster
We explicility count the number of your offline and online participation, and (min/max) normalize them at the end of the class. Your participation score will be zero if you haven't participated in class, Slack or other discussions..

Prerequisites

Required: CSCI 2041 Advanced Programming Principles

Recommended: CSCI 5521 Introduction to Machine Learning or any other course that covers fundamental machine learning algorithms.

Furthermore, this course assumes:

  • Good coding ability, corresponding to at least a third or fourth-year undergraduate CS major. Assignments will be in Python.
  • Background in basic probability, linear algebra, and calculus.

Notes to students

Academic Integrity

Assignments and project reports for the class must represent individual effort unless group work is explicitly allowed. Verbal collaboration on your assignments or class projects with your classmates and instructor is acceptable. But, everything you turn in must be your own work, and you must note the names of anyone you collaborated with on each problem and cite resources that you used to learn about the problem. If you have any doubts about whether a particular action may be construed as cheating, ask the instructor for clarification before you do it. Cheating in this course will result in a grade of F for course and the University policies will be followed.


Students with Disabilities

If you have a disability for which you are or may be requesting an accommodation, you are encouraged to contact both your instructor and Disability Resources Center (DRC).


COVID-19

All students are expected to abide by campus policies regarding COVID-19 including masking and vaccination requirements. This is an in-person class with daily in-person activities, but we may consider a hybrid or online option. If you're feeling sick, stay at home and catch up with the course materials instead of coming to class!