CSCI 8980-06 Intro to NLP
Spring 2022, Tuesday and Thursday, 4:00pm to 5:15pm, Keller Hall 2-260
Course Information
Natural Language Processing (NLP) is an interdisciplinary field that is based on theories in linguistics and cognitive/social science. The main focus of NLP is building computational models for applications such as machine translation and dialogue systems that can then interact with real users. Research and development in NLP therefore also includes considering important issues related to real-world AI, such as bias, ethics, controllability, and interpretability. This course will cover a broad range of topics related to NLP, from theories to computational models to data annotation and evaluation, leading to in-depth discussions with students. Students will read papers on those topics, create linguistically annotated data, and implement algorithms on applications they are interested in. Note that I will teach "NLP with Deep Learning" in Fall 2022 for those who are interested in computational aspects of NLP.
There will be a semester-long class project where you collect your own dataset, ensure it is accurate, develop a model using existing computing tools, evaluate the system, and consider its ethical and societal impacts. Every class, I will give a 30-minute lecture and students lead a discussion on the reading assignment for the rest. The grade will be evaluated based on the course project, participation, and assignments.
All class material will be posted on Canvas and on the class page. We will use Canvas for homework submissions and grading, Slack for discussion and QA. Please use Slack channels rather than personal emails or messages to ask questions. This helps other students, who may have the same question. Personal emails may not be answered. If you cannot make it to office hours, please use Slack to make an appointment.
- Instructor
- Dongyeop Kang (a.k.a DK)
- Class meets
- Tuesday and Thursday, 4:00pm to 5:15pm in Keller Hall 2-260
- Office hours
- Friday, 3:00pm to 4:30pm in Shepherd 259
- Class page
- dykang.github.io/classes/csci8980/Spring2021/
- Slack
- csci8980-06-s22.slack.com
- Canvas
- canvas.umn.edu/courses/302319
Schedule
We will cover basic models and represetnations, applications, and advanced topics.
Pleaes pay attention to due dates and project presentations.
You can use DK's office hours for project discussion.
🍬 is an optional reading.
Date | Topic | Readings (schedule) |
W1: Jan 18 |
Class Overview [slides] HW1 out (Paper Presentation) |
|
W1: Jan 20 |
Text Classification [slides] |
|
W2: Jan 25 |
Topic Modeling [slides] HW2 out (Paper Replication) |
🍬Blei, D. M. (2012) Probabilistic topic models Communications of the ACM, 55(4), 77-84. 🍬K-Means Clustering with scikit-learn |
W2: Jan 27 |
Language Models [slides] Project consultation (Office Hour) |
|
W3: Feb 1 |
Lexical Semantics [slides] Project Description out [slides] |
|
W3: Feb 3 |
Distributional Semantics [slides] Project consultation (Office Hour) |
|
W4: Feb 8 |
Contextualized Word Embeddings [slides] |
🍬Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach 🍬Smith, N. A. (2019). Contextual Word Representations: A Contextual Introduction 🍬 Fine-tuning tutorial on HuggingFace |
W4: Feb 10 |
Discourse [slides] HW2 due, Feb 10 11:59pm Project consultation (Office Hour) |
|
W5: Feb 15 |
Machine Translation [slides] |
🍬 Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate |
W5: Feb 17 |
Question Answering and Reasoning [slides] HW3 out (Error Analysis) Project Proposal Due, Feb 17 11:59pm |
|
W6: Feb 22 |
Dialogue [slides] |
🍬 Lewis, M., Yarats, D., Dauphin, Y. N., Parikh, D., & Batra, D. (2017). Deal or No Deal? End-to-End Learning for Negotiation Dialogues 🍬 Kang, D., Balakrishnan, A., Shah, P., Crook, P., Boureau, Y. L., & Weston, J. (2019). Recommendation as a Communication Game: Self-Supervised Bot-Play for Goal-oriented Dialogue 🍬 Dialogue system development frameworks: ParlAI and ConvKit |
W6: Feb 24 |
Summarization [slides] |
|
W7: Mar 1 |
No class | |
W7: Mar 3 |
Styles [slides] |
|
Mar 8 |
No class: Spring Break |
|
Mar 10 |
No class: Spring Break HW3 Due, Mar 11 11:59pm (extended) |
|
W8: Mar 15 |
Mid-way Project Presentation (Group A) |
|
W8: Mar 17 |
Mid-way Project Presentation (Group B) |
|
W9: Mar 22 |
Generation [slides] Kathleen McKeown's keynote speech at ACL 2020, Rewriting the Past: Assessing the Field through the Lens of Language Generation |
|
W9: Mar 22 |
Coreference and IE |
🍬 NeuralCoref 4.0 |
W9: Mar 24 |
Dataset Annotation [slides] |
|
W10: Mar 29 |
Hypothesis testing and Evaluation [slides] Student discussion |
|
W10: Mar 31 |
Social NLP Guest lecture by Anjalie Field (CMU) [slides] [recording] |
|
W11: Apr 5 |
Biases and ethics Guest lecture by Dr. Jieyu Zhao (UMD) [slides] [recording] |
|
W11: Apr 7 |
Robust and Adversarial NLP Guest lecture by Eric Wallace (UC Berkeley) [recording] |
|
W12: Apr 12 |
Controlability Guest lecture by Dr. Sumanth Dathathri (Google DeepMind) [recording] HW4 out (Data annotation) |
|
W12: Apr 14 |
Data-centric NLP Guest lecture by Dr. Swabha Swayamdipta (AI2/USC) [recording will not be available] |
|
W13: Apr 19 |
Language Grounding to Vision and Robotics Guest lecture by Dr. Yonatan Bisk (CMU) [recording will not be available] |
|
W13: Apr 21 |
Final project presentation (A) |
|
W14: April 26 |
Final project presentation (B) |
|
W14: April 28 |
Final project presentation (C) HW4 Due, May 5, 11:59pm Project report Due, May 5 11:59pm |
|
Interpretability |
|
|
Human-in-the-loop and Interactive NLP |
|
Grading and Late Policy
Grading
- 40% Homeworks (total four homeworks)
- 50% Final Project
- 10% (potential bonus) Class Participation
- Active participation in class discussions and project presentations
Late policy for deliverables
Each student will be granted 3 late days to use for homeworks over the duration of the semester. After all free late days are used up, penalty is 25% for each additional late day. However, projects submitted late after all late days have been used will receive no credit.Homework Details (40%)
HW1: Paper Presentation (10%)
Please check the list of papers in the Readings tab in the schedule and place your name on two papers this sheet. Presenters are limited to two per class, so do not assign yourself if there are already two presenters except for Jan 27 (first two papers on Jan 27 will be presented on Jan 25). .
You are responsible for presenting the papers in class and leading the discussion. During every class, two students present for 20 minutes each, including QA and discussion. First, make an overview of the paper (10 minutes) and prepare three points for discussion (10 minutes), such as limitations of the proposed method, future directions, links to other similar papers, etc.
Please upload your slides here before the class. It is possible to borrow slides from authors, but you must have a deep understanding of the work and provide potential discussion points. The filename of your slides should be 0120_{Paper Title}_{Your Name}_{first,second}.{pptx,pdf} Sometimes there are more than two comparative or incremental papers assigned in one bullet point; then, you have to make a comparison between them and get a bonus point (2%). In some classes like Jan 25, there are no specific papers to read so we discuss papers from next lecture's topics.
HW2: Paper Replication (10%)
Due: Feb 10 11:59pmYou will get a taste of NLP leaderboard culture in this homework. You need to choose one of the following NLP tasks and replicate/reimplement the model. I strongly recommend you use existing code written by authors that appear in Papers-with-Code leaderboard or to use some basic Transformer models implemented in HuggingFace libraries. Do not spend too much time replicating the code. Instead, run an existing code on your target dataset, ensure you use the same evaluation metrics as the paper, and compare your results to the paper's.
Note that you will be treated as cheating if you do not correctly cite any tool or paper you consulted. The homework would serve as the foundation for your homework 3 and 4, and possibly your project.
Choose one of the following models and datasets. If you would like to choose other tasks and datasets, please talk to DK by Jan 28.
Tasks | Datasets |
Sentiment classification |
|
Natural Language Inference |
|
Commonsense Reasoning |
|
Dialogue, Summarization, and Style Transfer |
|
QA and Visual QA |
|
Semantic Evaluation |
Please upload your code and report to Canvas by Feb 10 11:59pm.
- Code: a zipped file containing your training/inference scripts.
- Report: 2 - 3 pages, including model description with references, link to the original codes you referred to, evaluation metrics, performance comparison with other models in the leaderboard, training/inference time, sweeping of hyperparameters (e.g., learning parameter, dropout rate), and other details of the experiment.
- Papers-with-Code for leaderboards and state-of-the-art models
- Google Colab for model training
- ParlAI and ConvKit for dialogue
- HuggingFace Datasets and HuggingFace Models for Transformer models
HW3: Error Analysis (10%)
Due: Mar 11 11:59pmYou will now analyze the errors of your model implemented in the previous homework. HW3 consists of four steps where each step has a bonus point so please be creative and try other analysis techniques.
- Step #1: collect and featurize errors You first store incorrectly predicted samples (no more than 500) by your system in HW2 into a google spreadsheet. For each sample, you need to store the following information as separate columns:
- Input text
- Ground-truth label
- Predicted label with confidence score (i.e., softmax output from your classifier to the ground-truth label)
- (bonus points) other metadata or linguistic features using Spacy or other tools
E.g., length of the sentence, POS tags, named entities, sentiment score, etc - Step #2: label error types and fixes You go through each row and manually label them in the following categories:
- Types/Causes of errors, e.g., incorrect annotation and over-generalization
- Potential solutions to fix the cause, e.g., more training samples and some rules
- Step #3: visualize errors You visualize the errors and correctly predicted samples in a 2-dimensional semantic space and explore an overall view of how they are projected. Semantic space :
- Take vector representations of correct and incorrect samples from the classifier’s output (HuggingFace's model output class)
- Project them in reduced dimensions using PCA or t-SNE (paper, code) (i.e., 768 dimension -> 2 dimension)
- Project the samples in dataset map space (paper1 and paper2)
- where x-axis is confidence scores to ground-truth label and y-axis is variances of classifier’s prediction over the epochs of training
- Step #4: analyze You need to summarize important findings observed from the previous steps of analyses. Please discuss limitations of the current model used in your HW2 and potential future directions to address the errors.
- e.g., apply movie review to sentiment classifier trained on SST2
- e.g., apply medical text to entailment classifier trained on MNLI
(Bonus point) be creative in thinking of new error types and potential solutions.
(bonus point) Try a different out-of-distribution dataset on the same task
Please upload your annotated spreadsheet and report to Canvas by March 3 11:59pm.
- Spreadsheet: maximum 500 error samples with annotated errors, extracted features, and labeled types/fixes.
- Report: maximm 4 pages, including distribution of features/types/fixes, visualizations, and in-depth analysis and discussion.
- Find some examples from the lecture slides: S2-14
HW4: Data Annotation (10%)
Due: May 5, 11:59pmn this assignment, you will learn how data annotation works in NLP research and how important they are in NLP model development. Your group will form a group of three or four people, collect 300 adversarial samples on a target task that can fool the system you built in homework 2, annotate them by each of your team members, calculate inter-annotator agreement (IAA), and write down a short report on your experience.
Please read this description carefully.
Project Details (50%)
The class project is meant for a group of students (2~3) to taste a full pipeline of NLP research, from data annotation to model development to experiment and error analysis to visualization to discussion on limitations and ethical issues. Please read the project description slides.
A course project would be one of the following types:
- New research results judged suitable for acceptance to a top NLP or ML conference like ACL/EMNLP/NeurIPS/ICLR,
- Evaluation and critical analysis of existing work on a new dataset,
- An in-depth literature survey, or
- New open-source repository or dataset with a high impact on the community
Your project will be evaluated in the following criteria:
- Proposal and literature review (10%), Due: Feb 17, 11:50pm
- Maximum 3 pages
- Midterm presentation (10%), Mar 15 and 17
- 10-min presnetation and 5-min QA
- Check out the presentation schedule
- Upload your slides here before the class
- Expected content to be presented:
- Specific feedback you like to get from audience
- Motivation
- Problem definition
- Novel contribution compared to prior work
- Proposed methods
- Initial results
- Plan for the second half of the semester
- Final presentation (10%), April 21, 26, and 28
- 15-min presnetation and 10-min QA
- Check out the final presentation schedule
- Upload your slides here before your presentation
- Expected content to be presented:
- Motivation, problem definition, and novel contribution compared to prior work
- Proposed methods with "motivational examples"
- Experimental setups and final results
- Discussion on limitations, ethical issues, etc
- Conclusion and future directions
- Final report and code (20%), Due: May 5, 11:50pm
- Maximum 8 pages
- Rubrick for evaluation
Every group member should submit their report, link to code, and presentation slides on Canvas before the deadline. For both proposal and final reports, please use official ACL style templates (Overleaf or links). Note that your report and slides would be publicly shared on this page.
Prerequisites
CSCI 5521 Introduction to Machine Learning or any other course that covers fundamental machine learning algorithms.
Furthermore, this course assumes:
- Good coding ability, corresponding to at least a third or fourth-year undergraduate CS major. Assignments will be in Python.
- Background in basic probability, linear algebra, and calculus.
Notes to students
Academic Integrity
Assignments and project reports for the class must represent individual effort unless group work is explicitly allowed. Verbal collaboration on your assignments or class projects with your classmates and instructor is acceptable. But, everything you turn in must be your own work, and you must note the names of anyone you collaborated with on each problem and cite resources that you used to learn about the problem. If you have any doubts about whether a particular action may be construed as cheating, ask the instructor for clarification before you do it. Cheating in this course will result in a grade of F for course and the University policies will be followed.
Students with Disabilities
If you have a disability for which you are or may be requesting an accommodation, you are encouraged to contact both your instructor and Disability Resources Center (DRC).
COVID-19
All students are expected to abide by campus policies regarding COVID-19 including masking and vaccination requirements. This is an in-person class with daily in-person activities, but we may consider a hybrid or online option. If you're feeling sick, stay at home and catch up with the course materials instead of coming to class!
Textbook / Related Classes / Online Resources
Book
Textbook is not required but the following books are primarily referred:- Jurafsky and Martin, Speech and Language Processing, 3rd edition [online]
- Jacob Eisenstein. Natural Language Processing
Resources
- From Languages to Information, Dan Jurafsky, Stanford
- Natural Language Understanding, Christopher Potts, Stanford, Spring 2021
- Natural Language Processing, David Bamman, UC Berkeley
- Algorithms for NLP , Yulia Tsvetkov and David Mortensen, CMU