Sprachverarbeitung (Vorlesung und Übung)

The lecture covers topics in the area of data-driven text analysis. Supervised and unsupervised machine learning methods, as well as issues of evaluation and assessment of quantitative results are discussed. The lecture takes a methodological look at various problems in language processing and discusses how they can be and are addressed. In most approaches, there are several levels of understanding, all of which will be addressed: What is the idea/intuition? How can it be formalized, something with the help of mathematical models? How can the formal model finally be implemented (efficiently)? Partly, the basics of formal models or programming concepts have to be discussed, which is also part of the lecture.

Lecture and Exercise

Lecture (Thursdays) and tutorial (Tuesdays) are closely related in terms of content. Formally, they are two separate courses, namely "Computerlinguistik Übung" and "Sprachverarbeitung". If you do not want to/can not attend both courses, you are strongly advised to consult with the instructor.

Vorlesung (Donnerstags) und Übung (Dienstags) sind inhaltlich eng aufeinander bezogen. Formal handelt es sich um zwei getrennte Veranstaltungen, nämlich "Computerlinguistik Übung" und "Sprachverarbeitung". Wenn Sie nicht beide Veranstaltungen besuchen möchten/können, sollten Sie dringend mit dem Dozenten Rücksprache halten. Bitte bringen Sie zur Übung einen Computer mit.

Module zur Computerlinguistik

Seit dem Wintersemester 2022/2023 haben wir ein neues Konzept für die computerlinguistische Ausbildung im Studiengang BA Informationsverarbeitung ausgearbeitet.

Modul Grundlagen der Computerlinguistik (alte Studienordnung "Computerlinguistische Grundlagen")
- Seminar Computerlinguistische Grundlagen (immer im WiSe, Dozent Hermes, Inhalt: Linguistische Grundlagen, Annotation)
- Vorlesung Sprachverarbeitung (immer im SoSe, Dozent Reiter, Quantitative Eigenschaften von Sprache, Machine Learning)
- Übung Sprachverarbeitung (immer im SoSe, Dozent Reiter, begleitend zur Vorlesung, früher Seminar II)
- Modulprüfung Klausur (immer im SoSe, 90 Minuten, Teilleistung im WiSe möglich, 30 Minuten)
Modul Anwendungen der Computerlinguistik (alte Studienordnung "Angewandte Linguistische Datenverarbeitung")
- Übung Deep Learning (immer im WiSe, Dozentin Nester, Inhalt: Deep Learning Methoden)
- Hauptseminar Experimentelles Arbeiten in der Sprachverarbeitung (immer im WiSe, Dozent Reiter, Inhalt: Experimente in der CL, wo kommen Fortschritt und Erkenntnis her?)
- Modulprüfung Hausarbeit mit computerlinguistischem Experiment

Studienleistung und Modulprüfung / Study Achievements and Examination

There will be an exercise every week. We will start with the exercise together in the tuesday session. All exercises should be finished at home. Three times in the semester, you need to upload your results via ilias (as a zip file).

Material und Ressourcen / Material and Resources

The following literature is recommended background reading:

Dan Jurafsky/James H. Martin (2023). Speech and Language Processing. 3rd ed. Draft of Janaury 7, 2023. Prentice Hall. Available online here: https://web.stanford.edu/~jurafsky/slp3/
Christopher D. Manning/Hinrich Schütze (1999). Foundations of Statistical Natural Language Processing. Cambridge, Massachusetts and London, England: MIT Press. Selected chapters will be uploaded to Ilias.
Ian H. Witten/Eibe Frank (2005). Data Mining. 2nd ed. Practical Machine Learning Tools and Techniques. Elsevier. Selected chapters will be uploaded to Ilias.

In addition to this page (which is the central hub), we will make use of the following platforms:

Ilias, to provide you with non-public materials
Occasionally, I'll provide code on GitHub, which you can then download or clone
Klips, to register for the module exam
Spinfo servers. Accounts will be distributed via Ilias.
Screen casts and occassional recordings will be uploaded to this playlist on YouTube.

Themen- und Zeitplan

Woche 1 (KW 14)

Dienstag, 04. April: Vorstellung, Einführung, Ablauf, Introduction to the command line (Slides, Exercise)
Donnerstag, 06. April: Computational Linguistics as a discipline, corpora (Slides, Handout)

Woche 2 (KW 15)

Dienstag, 11. April: Linux command line, plain text files, command line corpus tools (Slides, Exercise, Reference Solution)
Donnerstag, 13. April: Zipf-distribution, Type-Token-Ratio, Frequent words, Encoding (Slides)

Woche 3 (KW 16)

Dienstag, 18. April: Regular expression, concordances (Slides, Exercise, Reference Solution)
Donnerstag, 20. April: Basic Probability Theory, Collocations (Slides)

Woche 4 (KW 17)

Dienstag, 25. April: Processes, tmux, nano, and our first neural network (Slides, Exercise, Reference Solution)
Donnerstag, 27. April: Inferential Statistics (Slides)

Woche 5 (KW 18)

Dienstag, 02. Mai: entfällt.
Donnerstag, 04. Mai: Language Modeling (Slides)

Woche 6 (KW 19)

Dienstag, 09. Mai: Data sets and file formats for machine learning (Slides, Exercise, Reference Solution)
Donnerstag, 11. Mai: Machine learning introduction, evaluation of classification systems (Slides)

Woche 7 (KW 20)

Dienstag, 16. Mai: Machine learning experiments with Weka (Slides, Exercise, Reference solution)
Donnerstag, 18. Mai: entfällt (Christi Himmelfahrt)

Woche 8 (KW 21)

Dienstag, 23. Mai: Naive Bayes (Slides)
Donnerstag, 25. Mai: Decision Tree (Slides)

Pfingstwoche (vorlesungsfrei, KW 22)

Woche 9 (KW 23)

Dienstag, 06. Juni: Entity reference detection with Weka (Slides, Werther_train.arff, feature-table.pdf, Demo Exercise 7, Demo Weka)
Donnerstag, 08. Juni: entfällt (Fronleichnam)

Woche 10 (KW 24)

Dienstag, 13. Juni: Entity reference detection with Weka, test phase (Werther_test.arff, Results Table, Decision tree implementation in Java)
Donnerstag, 15. Juni: Logistic Regression (Slides)