The lecture covers topics in the area of data-driven text analysis. Supervised and unsupervised machine learning methods, as well as issues of evaluation and assessment of quantitative results are discussed. The lecture takes a methodological look at various problems in language processing and discusses how they can be and are addressed. In most approaches, there are several levels of understanding, all of which will be addressed: What is the idea/intuition? How can it be formalized, often with the help of mathematical models? How can the formal model finally be implemented (efficiently)? Partly, the basics of formal models or programming concepts have to be discussed, which is also part of the lecture.

Please note that the class language is German, while material will (mostly) be in English. English questions during class are of course also okay.

Lecture and Exercise

Lecture (Thursdays) and tutorial (Tuesdays) are closely related in terms of content. Formally, they are two separate courses, namely "Computerlinguistik Übung" and "Sprachverarbeitung". If you do not want to/can not attend both courses, you are strongly advised to consult with the instructors. Please bring your own computer to the tutorial sessions.

Vorlesung (Donnerstags) und Übung (Dienstags) sind inhaltlich eng aufeinander bezogen. Formal handelt es sich um zwei getrennte Veranstaltungen, nämlich "Computerlinguistik Übung" und "Sprachverarbeitung". Wenn Sie nicht beide Veranstaltungen besuchen möchten/können, sollten Sie dringend mit den Dozenten Rücksprache halten. Bitte bringen Sie zur Übung einen Computer mit.

Module zur Computerlinguistik

Seit dem Wintersemester 2022/2023 haben wir ein neues Konzept für die computerlinguistische Ausbildung im Studiengang BA Informationsverarbeitung ausgearbeitet.

  • Modul Grundlagen der Computerlinguistik (alte Studienordnung "Computerlinguistische Grundlagen")
    • Seminar Computerlinguistische Grundlagen (immer im WiSe, Dozent Hermes, Inhalt: Linguistische Grundlagen, Annotation)
    • Vorlesung Sprachverarbeitung (immer im SoSe, Dozent Reiter, Quantitative Eigenschaften von Sprache, Machine Learning)
    • Übung Sprachverarbeitung (immer im SoSe, Dozent Pagel, begleitend zur Vorlesung, früher Seminar II)
    • Modulprüfung Klausur (immer im SoSe, 90 Minuten, Teilleistung im WiSe möglich, 30 Minuten)
  • Modul Anwendungen der Computerlinguistik (alte Studienordnung "Angewandte Linguistische Datenverarbeitung")
    • Übung Deep Learning (immer im WiSe, Dozentin Nester, Inhalt: Deep Learning Methoden)
    • Hauptseminar Experimentelles Arbeiten in der Sprachverarbeitung (immer im WiSe, Dozent Reiter, Inhalt: Experimente in der CL, wo kommen Fortschritt und Erkenntnis her?)
    • Modulprüfung Hausarbeit mit computerlinguistischem Experiment

Studienleistung und Modulprüfung / Study Achievements and Examination

We will start with the exercise together in the Tuesday session. All exercises should be finished at home. Five times in the semester, you need to upload your results via Ilias (as a zip file). There will be a written exam in the final week of the course.

Material und Ressourcen / Material and Resources

The following literature is recommended background reading:

  • Dan Jurafsky/James H. Martin (2025). Speech and Language Processing. 3rd ed. draft of January 12, 2025. Prentice Hall. Available online here: https://web.stanford.edu/~jurafsky/slp3/

  • Christopher D. Manning/Hinrich Schütze (1999). Foundations of Statistical Natural Language Processing. Cambridge, Massachusetts and London, England: MIT Press. Selected chapters will be uploaded to Ilias.

  • Ian H. Witten/Eibe Frank (2005). Data Mining. 2nd ed. Practical Machine Learning Tools and Techniques. Elsevier. Selected chapters will be uploaded to Ilias.

  • Melanie Andresen (2024). Computerlinguistische Methoden für die Digital Humanities. Narr Studienbücher. Publisher website. Selected chapters will be uploaded to Ilias.

In addition to this page (which is the central hub), we will make use of the following platforms:

  • Ilias, to provide you with non-public materials and to upload your solutions for the exercises
  • A Jupyter Server for running Python code on http://compute.spinfo.uni-koeln.de/ (only accessible from the university network or via VPN)
  • Klips, to register for the module exam

Topics and Schedule

Week 1

Week 2

Week 3

  • Tuesday, April 22: Exercise: Corpora
  • Thursday, April 24: Introduction machine learning

Week 4

  • Tuesday, April 29: Regular Expressions
  • Thursday, May 1: Feiertag (Public holiday)

Week 5

  • Tuesday, May 6: Canceled (DHCon)
  • Thursday, May 8: Evaluation in machine learning

Week 6

  • Tuesday, May 13: Exercise: Evaluation in machine learning
  • Thursday, May 15: Decision Trees

Week 7

  • Tuesday, May 20: Exercise: Decision Trees
  • Thursday, May 22: Naive Bayes

Week 8

  • Tuesday, May 27: Exercise: Naive Bayes
  • Thursday, May 29: Feiertag (Public holiday)

Week 9

  • Tuesday, June 3: Guest lecture (tbd)
  • Thursday, June 5: Logistic Regression

Pfingstwoche (vorlesungsfrei) / Pentecost holidays

Week 10

  • Tuesday, June 17: Exercise: Logistic Regression
  • Thursday, June 19: Feiertag (Public holiday)

Week 11

  • Tuesday, June 24: Lecture: Neural Networks Part I
  • Thursday, June 26: Lecture: Neural Networks Part II

Week 12

  • Tuesday, July 1: Exercise: Neural Networks
  • Thursday, July 3: BERT

Week 13

  • Tuesday, July 8: Exercise: BERT
  • Thursday, July 10: Round of questions for written exam

Week 14

  • Tuesday, July 15: Canceled
  • Thursday, July 17: Klausur (Written Exam)