Contents
Linguistic information processing (natural language processing, computational linguistics) deals with the processing of natural language using computers. Well-known applications are machine translation systems or the extraction of information from large amounts of text. In this lecture, we deal with the theoretical and practical basics of language processing. In addition to dealing with the special features of language and linguistic description categories, we will take a quantitative look at language and the various possibilities for automatically recognizing and processing linguistic phenomena. Here we deal in particular with machine learning methods.
Course Work and Module Exam / Studienleistung und Modulprüfung
The coursework consists of the completion and submission of around five homework assignments, which are set during the semester. The module examination can be taken by writing a written exam.
Die Studienleistung besteht in der Bearbeitung und Abgabe von etwa fünf Hausaufgaben, die im Laufe des Semesters gestellt werden. Die Modulprüfung kann durch das Schreiben einer Klausur abgelegt werden.
Agenda
- 10.10. Cancelled
- 17.10. Introduction, overview, ambiguity, linguistic levels (Slides)
- Literatur: Tsujii (2021), Mitkov (2022)
- Assignment 1 (until 30.10.): Reading of Tsujii (2021). Post one question and/or comment in Ilias. We will discuss questions and comments Oct. 31.
- 24.10. Linguistic levels (Slides)
- Literatur: Mitkov (2022)
- 31.10. Discussion of assignment 1 (eingereichte Fragen), text corpora (Slides)
- Literatur: Manning/Schütze (1999)
- 07.11. Quantitatively looking at words, Zipf, type-token-ratio, Automatic prediction of linguistic properties, evaluation, task types (Slides)
- Assignment 2 (until 21.11.): Go to https://opendiscourse.de/ and obtain speeches from (at least) two different politicians (preferably from different parties) so that they have a total of more than 10,000 words per person. Then write a program in a programming language of your choice that calculates the type-token-ratio for both. Document and interpret the result and upload it to Ilias!
- 14.11. Annotation (Slides)
- Literatur: Hovy/Lavid (2010), Reiter (2020)
- Assignment 3 (until 28.11.): Perform one linguistic and one non-linguistic annotation task. You'll find details in Ilias.
- 21.11. Discussion assignment 2, Machine learning 1: Naive Bayes (Slides)
- Literatur: Jurafsky/Martin (2023, Kapitel 4)
- 28.11. Discussion of assignment 3, evaluation of machine learning systems (Slides)
- Literatur: Skansi (2018)
- 05.12. Logistic regression, gradient descent (heute nur bis 12:55; Slides)
- Literatur: Skansi (2018)
- Assignment 4 (until 19.12.): Train a logistic regression model to recognize handwritten digits. The digits were written, scanned in black and white, and the images were then provided as 28x28 matrices with grayscale information. It involves only zeros and ones, and is therefore a binary classification task. You can find the training and test data in Ilias, as well as a Python script with a function to read in the data. Use the scikit-learn library for the actual training (and have a look around to see what else the library has to offer).
- 12.12. Neural networks (Slides)
- Literatur: Mikolov et al. (2013), Pilehvar/Camacho-Collados (2020)
- Links:
- Tensorflow-Playground: train and visualize a neural network in your browser
- Keras: python library for deep learning
- Other relevant python libraries: numpy, scikit-learn, pandas
- 19.12. Word embeddings, multi-class-classification (Slides, script 1, script 2, script 3)
- Literatur: Jurafsky/Martin (2023, draft)
- Assignment 5 (until 16.01.): see slides
- 09.01. Transformer models, BERT
- Literatur: Roberts et al. (2021), Skansi (2018)
- 16.01. Discussion of assignment 5, generative models and large language models (LLMs)
- Literatur: TBA
- 23.01. General exam questions
- 30.01. Exam