CSCI 5541, NLP

Fall 2024, Tuesdays and Thursdays , 11:15am to 12:30pm, Keller Hall 3-125


Course Information


Summary The purpose of this course is to provide an overview of the computational techniques developed to enable computers to interpret and respond appropriately to ideas expressed using natural languages, rather than formal languages, such as C++ or Python. This course will cover text classification, distributional representation methods of language, large language models, and advanced techniques in chatGPT. The course will cover a wide range of topics related to NLP, including theories, computational models, and applications with their societal and ethical impacts. Prerequisite: Maturity in linear algebra, calculus, and basic probability. Familiarity with Python. 5521 (recommended) or grad,

Natural Language Processing (NLP) is an interdisciplinary field that is based on theories in linguistics, cognitive science, and social science. The main focus of NLP is building computational models for applications such as machine translation and dialogue systems that can then interact with real users. Research and development in NLP therefore also includes considering important issues related to real-world AI systems, such as bias, controllability, interpretability, and ethics. This course will cover a broad range of topics related to NLP, from theories to computational models and applications to data annotation and evaluation. Students will read papers on those topics, create an annotated dataset, and implement algorithms on applications they are interested in. There will be a semester-long class project where you collect your own dataset, ensure it is accurate, develop a model using existing computing tools, evaluate the system, and consider its ethical and societal impacts.

The grade will be evaluated based on the course project, participation, and programming and reading assignments. All class material will be posted on the class site. We will use Canvas for homework and project submissions and grading, and Slack for discussion and QA. Email inquiries will be not be replied.

Instructors

Dongyeop Kang (DK)
Dongyeop Kang (DK)
Instructor
Shirley Anugrah Hayati
Shirley Anugrah Hayati
Graduate TA
James Mooney
James Mooney
Graduate TA
Zheng Robert Jia
Zheng Robert Jia
Undergraduate TA
Class meets
Tuesday and Thursday, 11:15AM to 12:30PM, Keller Hall 3-125
Office hours
DK: Friday 2:30pm - 3pm in Shepherd 259
Shirley: Monday 3:30pm - 4pm via Zoom
James: Wednesday 3pm - 4pm via Zoom
Robert: Tuesday 9-11am in Lind L103 Table 3
Class page
https://dykang.github.io/classes/csci5541/F24
Slack
csci5541f24.slack.com/
Canvas
canvas.umn.edu/courses/460609

Grading and Late Policy

Grading

Late policy for deliverables

Each student will be granted 5 late days to use for homeworks over the duration of the semester. After all free late days are used up, penalty is 1 point for each additional late day. The late days and penalty will be applied to all team members for group homework and project.

Schedule


We will cover basic NLP representations g(x), to build text classifiers P_theta(y|g(x)) , language models P_theta(g(x)), and large language models P_{theta is large}(g(x)). Based on knowledge you gain during the class, your team will develop your own NLP systems during the semester-long project. Pay attention to due dates and homework release. Lecture slides and homework/project description will be available in .

Date Lectures and Dues Readings
Sep 3 Class Overview
Sep 5 Intro to NLP
HW1 out
Sep 9 Recitation on comptuing basics (Robert)
  • Env Setup
  • Colab+JupyterNotebook Tutorial
  • PyTorch Basics
Sep 10 Text Classification
Tutorial on Scikit-Learn and PyTorch: (Shirley)
Sep 12 Text Classification (2)
Tutorial on Finetuning (James)
Sep 17 Distributional Semantics and Word Vectors
HW1 due
HW2 out
Project description out
Sep 19 Distributional Semantics and Word Vectors (2)
Project Team Formation due
Sep 24 Language Models (1): Ngram LM, Neural LM
HW2 due
Sep 26 Project Guideline
Language Models (2): RNNs, LSTMs and Sequence-to-Sequence
HW3 out
Oct 3 Language Models (4): Evaluation and Applications
HW4 out
Oct 8 Project Proposal Pitch (1)
HW3 due
Slides Deck for Group A
Group A:
  • Visual Linguists (Koustav Banerjee, Hardik Gupta, Lin Xie) (mentor: Shirley, James)
  • Scale is all you need (Weiwen Chen) (mentor: James, DK)
  • LinguaTech (Yuxin Chen, Tong Liao, Jacob Sun) (mentor: James, Robert)
  • Mosaic Pineapple (Nhi Dang, Harrison Wallander, Riana Hoagland) (mentor: Shirley, James)
  • NextGenGenerative(Yi-ching Ho, Oran Frenstad, Weixuan Lin) (mentor: DK, Robert)
  • Netflix, Lazy, Procrastinate (Charles Hart, Isaac Blumhoefer, Lane Versteeg) (DK, Robert)
  • Rimika & Akansha (Rimika Dhara, Akansha Kamineni) (mentor: Shirley, Robert)
  • Backpropagation Nation (Thomas Knickerbocker, Owen Ratgen, Yashas Acharya) (mentor: Shirley, DK)
  • Tired Tokenizers (Nicholas Padilla, Jack LeGeault, Jacob Cadavez) (mentor: DK, James)
Oct 15 Contextualized Word Embeddings
Oct 17 Transformers

Oct 22 Pretraining and Scaling Laws
HW4 due
Oct 24 Prompting
Oct 29 Instructing and augmenting LLMs
HW5 out
Oct 31 LLMs as Agents (Zae)
Project midterm office-hour due
  • ReAct: Synergizing Reasoning and Acting in Language Models
  • MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
  • Generative Agents: Interactive Simulacra of Human Behavior
  • WebArena: A Realistic Web Environment for Building Autonomous Agents
  • Nov 5 Ethics and Explainability (Shirley)
    Nov 7 All about Data and Annotation
    HW6 out
    Nov 12 No Class (EMNLP)
    HW5 due
    Nov 14 Efficient NLP (James)
    Nov 19 Alignment (with Karin and Ryan)
    Nov 21 Multimodal NLP (James)
    HW6 due
  • Robust Speech Recognition via Large-Scale Weak Supervision
  • High-Resolution Image Synthesis with Latent Diffusion Models
  • Learning Transferable Visual Models From Natural Language Supervision
  • The Llama 3 Herd of Models
  • Nov 26 Human-centric NLP
    Concluding Remark
    Nov 28 No Class (Thanksgiving)
    Dec 3 Final Project Poster (1)
    Project poster due
    Posters for Group B
    Dec 5 Final Project Poster (2)
    Project final report due (Dec 12)
    Posters for Group A

    Homework Details (60%)

    All questions regarding homework MUST be communicated with the lead TA over Slack homework channels (e.g., #hw1, #hw2) or during their office hours. Homework 1, 2, 3, and 6 should be done individually, while homework 4 and 5 are team-based (maximum of 4 people). Your team for homework 4 and 5 should be the same for the project team. The use of outside resources (books, research papers, websites, etc.) or collaboration (students, professors, chatGPT, etc.) must be explicitly acknowledged in your report. Check out the notes for academic intergrity.

    The deadline for all homework is by Friday midnight (11:59PM) of the due date. Due to a tight schedule, there will be no deadline extension, but you can still use your late days. For the delayed team homework, late days for every team member will be counted. Check out the homework description and link to canvas for submission:

    Here are homework assignments with dues:

    • HW1: Building MLP-based text classifier with pytorch (5 points, Individual, due: Sep 13 Friday 17 Tuesday) (, )
    • HW2: Finetuning text classifier using HuggingFace (10 points, Individual, due: Sep 20 Friday 24 Tuesday) (, )
    • HW3: Authorship attribution using language models (LMs) (10 points, Individual, due: Oct 4 Friday 8 Tuesday) (, )
    • HW4: Generating and evaluating text generated from pretrained LMs (15 points, Team, due: Oct 18 Friday 22 Tuesday) (, )
    • HW5: Prompting with large language models (LLMs) (15 points, Team, due: Nov 12 Tuesday) (, )
    • HW6: Essay writing with ChatGPT (5 points, Individual, due: Nov 22 Friday) (, )

    Project Details (30%)

    First, carefully read the project description , as most project information, dues, rubric, and answers to your questions are in the description document. It is your responsbililty to miss any information regarding the project. Your team (maximum of 4 people) should submit their report, link to code (or a zipped code), and presentation slides/poster to Canvas before the deadline. Use official ACL style templates (Overleaf or links). Here are some dues you have to submit for project (note that some dues are during week days):

    • Team formation (1 point, due: Sep 19) ()
    • Project brainstorming (1 point, due: Oct 1) ()
    • Proposal pitch (3 points, due: Oct 8 and 10) (Slides decks for Group A and Group B)
    • Proposal report (5 points, due: Oct 13) ()
    • Midterm office hour participation (5 points, due: Oct 31) ()
    • Poster presetnation (5 points, due: Dec 3 and 5) ()
    • Final report (10 points, due: Dec 12) () (evaluation rubric)

    You can find some selected project reports and posters from the previous years' NLP classes below. Some projects are extended and published top-tier workshop and conferences:

    • [CSCI 5541 S23] Simulating Everyone's Voice: Exploring ChatGPTs Ability to Simulate Human Annotators
    • [CSCI 5541 S23] Vision & Language-guided Generalized Object Grasping
    • [CSCI 5541 S23] Generalizability of FLAN-T5 Model Using Composite Task Prompting
    • [CSCI 5541 S23] Comparing the Effectiveness of Fine-tuning vs. One-Shot Learning on the Kidz Bopification Task
    • [CSCI 5980 F22] Generating Controllable Long-dialogue with Coherence Published in AAAI 2024
    • [CSCI 8980 S22] Understanding Narrative Transportation in Fantasy Fanfiction Published in Workshop on Narrative Understanding (WNU) @ACL 2023

    Class Participation (10%)

    Your class participation is thoroughly evaluated. Put your profile picture on Canvas and Slack so we can match you for the final evaluation. The following metrics will be used to grade your participation:
    • Participation and discussion in class
    • Discussion on Slack and during Office Hours for both instructor and TAs
    • Discussion and QA during the presentation of the project proposal and poster
    We explicility count the number of your offline and online participation, and (min/max) normalize them at the end of the class. Your participation score will be zero if you haven't participated in class, Slack or other discussions..

    Prerequisites

    Required: CSCI 2041 Advanced Programming Principles

    Recommended: CSCI 5521 Introduction to Machine Learning or any other course that covers fundamental machine learning algorithms.

    Furthermore, this course assumes:

    • Good coding ability, corresponding to at least a third or fourth-year undergraduate CS major. Assignments will be in Python.
    • Background in basic probability, linear algebra, and calculus.

    Notes to students

    Academic Integrity

    Assignments and project reports for the class must represent individual effort unless group work is explicitly allowed. Verbal collaboration on your assignments or class projects with your classmates and instructor is acceptable. But, everything you turn in must be your own work, and you must note the names of anyone you collaborated with on each problem and cite resources that you used to learn about the problem. If you have any doubts about whether a particular action may be construed as cheating, ask the instructor for clarification before you do it. Cheating in this course will result in a grade of F for course and the University policies will be followed.


    Students with Disabilities

    If you have a disability for which you are or may be requesting an accommodation, you are encouraged to contact both your instructor and Disability Resources Center (DRC).


    COVID-19

    All students are expected to abide by campus policies regarding COVID-19 including masking and vaccination requirements. This is an in-person class with daily in-person activities, but we may consider a hybrid or online option. If you're feeling sick, stay at home and catch up with the course materials instead of coming to class!