Contents

Linguistic information processing (natural language processing, computational linguistics) deals with the processing of natural language using computers. Well-known applications are machine translation systems or the extraction of information from large amounts of text. In this lecture, we deal with the theoretical and practical basics of language processing. In addition to dealing with the special features of language and linguistic description categories, we will take a quantitative look at language and the various possibilities for automatically recognizing and processing linguistic phenomena. Here we deal in particular with machine learning methods.

Course Work and Module Exam / Studienleistung und Modulprüfung

The coursework consists of the completion and submission of around five homework assignments, which are set during the semester. The module examination can be taken by writing a written exam.

Die Studienleistung besteht in der Bearbeitung und Abgabe von etwa fünf Hausaufgaben, die im Laufe des Semesters gestellt werden. Die Modulprüfung kann durch das Schreiben einer Klausur abgelegt werden.

Agenda

  • 10.10. Cancelled
  • 17.10. Introduction, overview, ambiguity, linguistic levels (Mitkov 2022)
    • Literatur: Tsujii (2021), Mitkov (2022)
    • Assignment 1 (until 30.10.): Reading of Tsujii (2021). Post one question and/or comment in Ilias. We will discuss questions and comments Oct. 31.
  • 24.10. Linguistic levels
    • Literatur: Mitkov (2022)
  • 31.10. Discussion of assignment 1, text corpora
    • Literatur: Manning/Schütze (1999)
  • 07.11. Quantitatively looking at words, Zipf, type-token-ratio, Automatic prediction of linguistic properties, evaluation, task types
    • Assignment 2 (until 14.11.): Go to https://opendiscourse.de/ and obtain speeches from (at least) two different politicians (preferably from different parties) so that they have a total of more than 10,000 words per person. Then write a program in a programming language of your choice that calculates the type-token-ratio for both. Document and interpret the result and upload it to Ilias!
  • 14.11. Discussion assignment 2, annotation
    • Literatur: Hovy/Lavid (2010), Reiter (2020)
    • Assignment 3 (until 28.11.): Perform one linguistic and one non-linguistic annotation task. You'll find details in Ilias.
  • 21.11. Machine learning 1: Naive Bayes
    • Literatur: Jurafsky/Martin (2023, Kapitel 4)
  • 28.11. Discussion of assignment 3, evaluation of machine learning systems
    • Literatur: Skansi (2018)
  • 05.12. Logistic regression, gradient descent
    • Literatur: Skansi (2018)
    • Assignment 4 (until 19.12.): Train a logistic regression model to recognize handwritten digits. The digits were written, scanned in black and white, and the images were then provided as 28x28 matrices with grayscale information. It involves only zeros and ones, and is therefore a binary classification task. You can find the training and test data in Ilias, as well as a Python script with a function to read in the data. Use the scikit-learn library for the actual training (and have a look around to see what else the library has to offer).
  • 12.12. Neural networks
    • Literatur: Mikolov et al. (2013), Pilehvar/Camacho-Collados (2020)
    • Links:
  • 19.12. Word embeddings, overfitting
    • Literatur: Jurafsky/Martin (2023, draft)
    • Assignment 5 (until 16.01.): TBA
  • 09.01. Transformer models, BERT
    • Literatur: Roberts et al. (2021), Skansi (2018)
  • 16.01. Discussion of assignment 5, generative models and large language models (LLMs)
    • Literatur: TBA
  • 23.01. General exam questions
  • 30.01. Exam

Veranstaltungsmaterialien

Ilias | Literatur (Zotero)