CSCI 5541, NLP

Spring 2023, Monday and Wednesday, 4:00pm to 5:15pm, Mechanical Engineering 108


Course Information


Natural Language Processing (NLP) is an interdisciplinary field that is based on theories in linguistics, cognitive science, and social science. The main focus of NLP is building computational models for applications such as machine translation and dialogue systems that can then interact with real users. Research and development in NLP therefore also includes considering important issues related to real-world AI systems, such as bias, controllability, interpretability, and ethics. This course will cover a broad range of topics related to NLP, from theories to computational models and applications to data annotation and evaluation, leading to in-depth discussions with students. Students will read papers on those topics, create an annotated dataset, and implement algorithms on applications they are interested in. There will be a semester-long class project where you collect your own dataset, ensure it is accurate, develop a model using existing computing tools, evaluate the system, and consider its ethical and societal impacts. The grade will be evaluated based on the course project, participation, and assignments.

8980 vs 5980 vs 5541: Some lectures across the three classes will be shared but they have different focuses; 5980 (NLP with Deep Learning) focuses on more "processing" parts of NLP, particularly with deep learning methods. Students will gain an instruction to cutting-edge techniques in deep learning for NLP. 8980 (Intro to NLP Research) covers broad aspects of NLP research as an interdisciplinary problem, including theory grounding, data annotation, error analysis, and applications to different fields. 5541 (NLP, current course) is an introductory class to cover some basic NLP techniques with applications such as question answering, dialogue, and machine translation.

All class material will be posted on the class page. We will use Canvas for homework submissions and grading, and Slack for discussion and QA.

Instructor
Dongyeop Kang (a.k.a DK)
Class meets
Monday and Wednesday, 4:00pm to 5:15pm, Mechanical Engineering 108
TAs
Debarati Das (das00015@umn.edu)
Risako Owan (owan0002@umn.edu)
Office hours
DK: Friday, 4pm to 5pm in Shepherd 259
TAs: Tuesday and Thursday, 3pm to 4pm via Zoom

Class page
dykang.github.io/classes/csci5541/S23
Slack
csci5541s23.slack.com/
Canvas
canvas.umn.edu/courses/355495

Grading and Late Policy


Grading

  • 65% Homeworks (total five homeworks)
  • 25% Project
  • 10% Class Participation
    • Active participation in class discussion and project presentations

Late policy for deliverables

Each student will be granted 3 late days to use for homeworks over the duration of the semester. After all free late days are used up, penalty is 25% for each additional late day. However, projects submitted late after all late days have been used will receive no credit.

Schedule


We will cover basic NLP representations and applications, with some advanced topics. Pleaes pay attention to due dates and project presentations.

Date Topic Readings
Jan 16 No Class (Martin Luther King Jr. Day)
Jan 18 Class Overview (slides)
Jan 23 Intro to NLP (slides (updated))
HW0 out (link)
Jan 25 Text Classification 1 (slides (updated))
Tutorial on Python programming (notebook) (Debarati Das)
Jan 30 Text Classification 2
Tutorial on PyTorch programming (notebook, slides) (Debarati Das)
HW0 due
  • (optional) Conversations Gone Awry: Detecting Early Signs of Conversational Failure
  • (optional) Style is NOT a single variable: Case Studies for Cross-Style Language Understanding
  • Text classifier with NLTK and Scikit-Learn
  • Feb 1 Tutorial on HuggingFace library (notebook) (Risako Owan)
    HW1 out (link)
    Feb 6 Lexical Semantics (slides (updated))
    Feb 8 No Class (AAAI)
    Feb 13 Distributional Semantics and word vectors 1 (slides (updated))
  • From Frequency to Meaning: Vector Space Models of Semantics
  • Efficient Estimation of Word Representations in Vector Space
  • Gensim's word2vec tutorial
  • Feb 15 Distributional Semantics and word vectors 2
    HW1 due
    HW2 out (link)
  • Linguistic Regularities in Continuous Space Word Representations
  • GloVe: Global Vectors for Word Representation
  • Retrofitting Word Vectors to Semantic Lexicons
  • Feb 20 Language Models 1: Ngram LM, Neural LM, RNNs (slides (updated))
  • Chapter 3 of Jurafsky and Martin
  • A Neural Probabilistic Language Model
  • Long Short-Term Memory
  • Feb 22 Project Guideline (slides, doc)
    Feb 27 Language Models 2: Search Algorithms (slides (updated))
  • The Curious Case of Neural Text Degeneration
  • Mutual Information and Diverse Decoding Improve Neural Machine Translation
  • Mar 1 Language Models 3: Search in Training, Evaluation (slides)
    HW2 due
    HW3 out (link)
  • Sequence Level Training with Recurrent Neural Networks
  • An Actor-Critic Algorithm for Sequence Prediction
  • Training language models to follow instructions with human feedback
  • Mar 6 No class (Spring Break)
    Mar 8 No class (Spring Break)
    Mar 13 Contextualized Word Embeddings (slides (updated))
  • Deep contextualized word representations
  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • A Primer in BERTology: What we know about how BERT works
  • Mar 15 Deep Dive on Transformer (slides (updated))
    Project proposal due (Mar 13 Mar 17)
  • Attention is All you Need
  • Tutorial on Illustrated Transformer
  • Language Models are Unsupervised Multitask Learners
  • Language Models are Few-Shot Learners
  • Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
  • Mar 20 Deep Dive on Transformer (continued)
    Mar 22 Pretraining (slides (updated))
    HW3 due (Mar 24)
    HW4 out (link)
  • Scaling Laws for Neural Language Models
  • On the Opportunities and Risks of Foundation Models
  • On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
  • Mar 27 Prompting (1) (slides (updated))
  • Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing
  • Calibrate Before Use: Improving Few-Shot Performance of Language Models
  • Prefix-Tuning: Optimizing Continuous Prompts for Generation
  • Mar 29 Project proposal showcase and QA
    (links to proposals are not clickable)
    Mentor: Risako Owan (zoom, slides)
    Mentor: Debarati Das (zoom, slides)
    Mentor: DK (in class and zoom, slides)
    Apr 3 Prompting (2) (slides (updated))
    Apr 5 Application: Data Annotation (slides)
    HW4 due (Apr 3 Apr 7)
    HW5 out (link)
    Apr 10 Application: Data Annotation (continued)
    Apr 12 Application: Machine Translation (slides)
    Project midterm discussion due (Apr 14)
  • SQuAD: 100,000+ Questions for Machine Comprehension of Text
  • HOTPOTQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
  • Learning to Compose Neural Networks for Question Answering
  • Apr 17 Application: Question Answering (slides)
    HW5 due
    Apr 19 Application: Generation and Summarization (slides)
    Application: Dialgoue (slides)
    Apr 24 Final Project Poster Session A
    Apr 26 Final Project Poster Session B
    Final final report due (May 5)

    Homework Details (65%)


    Collaboration is required (maximum of 4 people). Questions should be communicated with TAs, and please use the shared Slack channels (e.g., #hw1) to share them with others. The use of outside resources (books, research papers, websites, etc.) or collaboration (students, professors, chatGPT, etc.) must be explicitly acknowledged. Check out the notes to students.

    • HW0: Building a text classifier with pytorch from scratch (0 points, due: Jan 30) (link)
    • HW1: Finetuning a text classifier using HuggingFace (15 points, due: Feb 13 Feb 15) (link)
    • HW2: Building ngram language models (LM) from scratch (15 points, due: Mar 1) (link)
    • HW3: Generating text from pretrained LMs (15 points, due: Mar 20 Mar 22 March 24) (link)
    • HW4: Prompting with large language models (LLMs) (10 points, due: Apr 3Apr 7) (link)
    • HW5: Evaluating NLP systems (10 points, due: Apr 17) (link)

    Project Details (25%)


    Please carefully read the project description (link) first. Every group member (maximum of 4 people) should submit their report, link to code (or a zipped code), and presentation slides/poster on Canvas before the deadline. Your project will be evaluated in the following criteria:

    • Proposal report (5 points, due: Mar 13 Mar 17)
    • Midterm office hour participation (5 points, due: Apr 10)
    • Final report and poster poresentation (15 points, due: May 5)
    For both proposal and final reports, please use official ACL style templates (Overleaf or links). Your final project report will be evaluated based on this rubrick. Note that your report and slides would be publicly shared. A course project would be one of the following types:
    • Critical analysis of existing model/dataset (default project),
    • New research results judged suitable for acceptance to a NLP or ML workshop,
    • Collection of your own dataset on new problems or adversarial datasets that can fool the existing systems ,
    • An in-depth literature survey on emerging topics,
    • Interactive demonstration (e.g., Chrome Extension, Flask) or visualization of existing systems,
    • New open-source repository or dataset with a high impact on the community
    You can find some of the previous years' project reports below:

    Prerequisites


    Required: CSCI 2041 Advanced Programming Principles

    Recommended: CSCI 5521 Introduction to Machine Learning or any other course that covers fundamental machine learning algorithms.

    Furthermore, this course assumes:

    • Good coding ability, corresponding to at least a third or fourth-year undergraduate CS major. Assignments will be in Python.
    • Background in basic probability, linear algebra, and calculus.

    Notes to students


    Academic Integrity

    Assignments and project reports for the class must represent individual effort unless group work is explicitly allowed. Verbal collaboration on your assignments or class projects with your classmates and instructor is acceptable. But, everything you turn in must be your own work, and you must note the names of anyone you collaborated with on each problem and cite resources that you used to learn about the problem. If you have any doubts about whether a particular action may be construed as cheating, ask the instructor for clarification before you do it. Cheating in this course will result in a grade of F for course and the University policies will be followed.


    Students with Disabilities

    If you have a disability for which you are or may be requesting an accommodation, you are encouraged to contact both your instructor and Disability Resources Center (DRC).


    COVID-19

    All students are expected to abide by campus policies regarding COVID-19 including masking and vaccination requirements. This is an in-person class with daily in-person activities, but we may consider a hybrid or online option. If you're feeling sick, stay at home and catch up with the course materials instead of coming to class!



    Book

    Textbook is not required but the following books are primarily referred:
    • Jurafsky and Martin, Speech and Language Processing, 3rd edition [online]
    • Jacob Eisenstein. Natural Language Processing

    Resources

    The course materials are inspired by the slides of Chris Manning at Stanford, David Bamman at UC Berkeley, and Graham Neubig at CMU.