Information extraction 2400-ZEWW974
The course will sequentially cover the following topics:
1. Introduction to text data processing:
tokenisation,
stemming,
lemmatisation,
stopwords,
n-grams,
TF, IDF, and TF-IDF metrics.
2. Rule-based and heuristic methods:
pattern-based rules: pattern matching, regular expressions, Finite State Machines,
dictionary-based rules: exact dictionary matching, token-based matching, Aho-Corasick algorithm, Levenshtein distance,
syntax-based rules: dependency parsing, constituency parsing, context-free grammar, part-of-speech tagging.
3. Corpus-based methods:
TF-IDF,
RAKE,
topic models: LSA, NMF, PLSA, LDA, BERTopic.
4. Graph-based methods:
PageRank,
TextRank,
LexRank,
HITS.
5. Machine learning-based methods:
Hidden Markov Models,
Conditional Random Fields.
6. Neural network-based methods:
introduction to deep learning,
word and sentence embeddings, large language models,
recurrent neural networks,
LSTM, BiLSTM,
convolutional neural networks,
graph neural networks.
7. Transformer-based methods:
BERT,
GPT,
SpanBERT, LUKE, LLamaIndex, LangChain.
Szacunkowy nakład pracy studenta:
Typ aktywności K (kontaktowe) S (samodzielne)
wykład (zajęcia) 0 0
ćwiczenia (zajęcia) 30 30
egzamin 0 0
konsultacje 5 0
przygotowanie do ćwiczeń 0 10
przygotowanie do wykładów 0 0
przygotowanie do kolokwium 0 0
przygotowanie do egzaminu 0 0
… 0 0
Razem 35 40 = 75
Type of course
Prerequisites (description)
Course coordinators
Learning outcomes
Students will learn how to prepare textual data for the purpose of extracting and structuring the information it contains. They will gain an understanding of the theoretical foundations behind the algorithms used for these tasks, as well as become familiar with the practical aspects of implementing them in code. By the end of the course, students will be able to automatically extract information from text, selecting appropriate methods based on the specific characteristics of the problem at hand. Additionally, they will be aware of current challenges and issues related to information extraction.
Assessment criteria
The final grade will be determined based on: a home-taken project (70% of the grade) and a project presentation (30% of the grade).
The assessment will be both written (project) and oral (project presentation).
Additional information
Additional information (registration calendar, class conductors, localization and schedules of classes), might be available in the USOSweb system: