The lecture covers topics in the area of data-driven text analysis. Supervised and unsupervised machine learning methods, as well as issues of evaluation and assessment of quantitative results are discussed. The lecture takes a methodological look at various problems in language processing and discusses how they can be and are addressed. In most approaches, there are several levels of understanding, all of which will be addressed: What is the idea/intuition? How can it be formalized, often with the help of mathematical models? How can the formal model finally be implemented (efficiently)? Partly, the basics of formal models or programming concepts have to be discussed, which is also part of the lecture.
Please note that the class language is German, while material will (mostly) be in English. English questions during class are of course also okay.
Lecture and Exercise
Lecture (Thursdays) and tutorial (Tuesdays) are closely related in terms of content. Formally, they are two separate courses, namely "Computerlinguistik Übung" and "Sprachverarbeitung". If you do not want to/can not attend both courses, you are strongly advised to consult with the instructors. Please bring your own computer to the tutorial sessions.
Vorlesung (Donnerstags) und Übung (Dienstags) sind inhaltlich eng aufeinander bezogen. Formal handelt es sich um zwei getrennte Veranstaltungen, nämlich "Computerlinguistik Übung" und "Sprachverarbeitung". Wenn Sie nicht beide Veranstaltungen besuchen möchten/können, sollten Sie dringend mit den Dozenten Rücksprache halten. Bitte bringen Sie zur Übung einen Computer mit.
Module zur Computerlinguistik
Seit dem Wintersemester 2022/2023 haben wir ein neues Konzept für die computerlinguistische Ausbildung im Studiengang BA Informationsverarbeitung ausgearbeitet.
- Modul Grundlagen der Computerlinguistik (alte Studienordnung "Computerlinguistische Grundlagen")
- Seminar Computerlinguistische Grundlagen (immer im WiSe, Dozent Hermes, Inhalt: Linguistische Grundlagen, Annotation)
- Vorlesung Sprachverarbeitung (immer im SoSe, Dozent Reiter, Quantitative Eigenschaften von Sprache, Machine Learning)
- Übung Sprachverarbeitung (immer im SoSe, Dozent Pagel, begleitend zur Vorlesung, früher Seminar II)
- Modulprüfung Klausur (immer im SoSe, 90 Minuten, Teilleistung im WiSe möglich, 30 Minuten)
- Modul Anwendungen der Computerlinguistik (alte Studienordnung "Angewandte Linguistische Datenverarbeitung")
- Übung Deep Learning (immer im WiSe, Dozentin Nester, Inhalt: Deep Learning Methoden)
- Hauptseminar Experimentelles Arbeiten in der Sprachverarbeitung (immer im WiSe, Dozent Reiter, Inhalt: Experimente in der CL, wo kommen Fortschritt und Erkenntnis her?)
- Modulprüfung Hausarbeit mit computerlinguistischem Experiment
Studienleistung und Modulprüfung / Study Achievements and Examination
We will start with the exercise together in the Tuesday session. All exercises should be finished at home. Five times in the semester, you need to upload your results via Ilias (as a zip file). There will be a written exam in the final week of the course.
Material und Ressourcen / Material and Resources
The following literature is recommended background reading:
-
Dan Jurafsky/James H. Martin (2025). Speech and Language Processing. 3rd ed. draft of January 12, 2025. Prentice Hall. Available online here: https://web.stanford.edu/~jurafsky/slp3/
-
Christopher D. Manning/Hinrich Schütze (1999). Foundations of Statistical Natural Language Processing. Cambridge, Massachusetts and London, England: MIT Press. Selected chapters will be uploaded to Ilias.
-
Ian H. Witten/Eibe Frank (2005). Data Mining. 2nd ed. Practical Machine Learning Tools and Techniques. Elsevier. Selected chapters will be uploaded to Ilias.
-
Melanie Andresen (2024). Computerlinguistische Methoden für die Digital Humanities. Narr Studienbücher. Publisher website. Selected chapters will be uploaded to Ilias.
In addition to this page (which is the central hub), we will make use of the following platforms:
- Ilias, to provide you with non-public materials and to upload your solutions for the exercises
- A Jupyter Server for running Python code on http://compute.spinfo.uni-koeln.de/ (only accessible from the university network or via VPN)
- Klips, to register for the module exam
Topics and Schedule
Week 1
- Tuesday, April 8: Introduction, course overview, Python crash-course part I [Slides] [Exercise 01] [Solution01] [Solution01 Notebook]
- Thursday, April 10: Introduction computational linguistics [Slides]
Week 2
- Tuesday, April 15: Python crash-course part II [Slides] [Exercise 02] [romeo.txt] [chars.tsv]
- Thursday, April 17: Corpora, corpus statistics, encoding
Week 3
- Tuesday, April 22: Exercise: Corpora
- Thursday, April 24: Introduction machine learning
Week 4
- Tuesday, April 29: Regular Expressions
- Thursday, May 1: Feiertag (Public holiday)
Week 5
- Tuesday, May 6: Canceled (DHCon)
- Thursday, May 8: Evaluation in machine learning
Week 6
- Tuesday, May 13: Exercise: Evaluation in machine learning
- Thursday, May 15: Decision Trees
Week 7
- Tuesday, May 20: Exercise: Decision Trees
- Thursday, May 22: Naive Bayes
Week 8
- Tuesday, May 27: Exercise: Naive Bayes
- Thursday, May 29: Feiertag (Public holiday)
Week 9
- Tuesday, June 3: Guest lecture (tbd)
- Thursday, June 5: Logistic Regression
Pfingstwoche (vorlesungsfrei) / Pentecost holidays
Week 10
- Tuesday, June 17: Exercise: Logistic Regression
- Thursday, June 19: Feiertag (Public holiday)
Week 11
- Tuesday, June 24: Lecture: Neural Networks Part I
- Thursday, June 26: Lecture: Neural Networks Part II
Week 12
- Tuesday, July 1: Exercise: Neural Networks
- Thursday, July 3: BERT
Week 13
- Tuesday, July 8: Exercise: BERT
- Thursday, July 10: Round of questions for written exam
Week 14
- Tuesday, July 15: Canceled
- Thursday, July 17: Klausur (Written Exam)