Prowadzony w cyklach: 2024Z, 2025Z

Kod Erasmus: 14.3

Kod ISCED: 0311

Punkty ECTS: 3

Język: angielski

Organizowany przez: Wydział Nauk Ekonomicznych

Topic modelling 2400-ZEWW878

Full description of the course
(max 65.000 characters)

1. Introductory matters (labs 1).
a. What is topic modelling?
b. What is the procedure for obtaining topics and drawing conclusions?
c. Example practical applications of topic modelling.
2. Collecting textual data for topic modelling (labs 2).
a. Review of web scraping and crawling techniques.
b. Most common technical issues.
c. Ethics and possible legal problems.
d. Review of Python libraries: Selenium and Beautiful Soup with example codes.
3. Textual data preprocessing (labs 3).
a. Tokenization.
b. Stemming.
c. Lemmatization.
d. Stopwords.
e. N-grams.
f. TermFrequence (TF).
g. Inverse Document Frequency (IDF).
h. TF-IDF.
4. Semantic topic modelling algorithms (labs 4-5).
a. Latent Semantic Analysis (LSA).
b. Non-Negative Matrix Factorization (NNMF).

5. Probabilistic topic modelling algorithms (labs 5-6).
a. Probabilistic Latent Semantic Analysis (PLSA).
b. Latent Dirichlet Allocation (LDA).

6. Measures of models’ performance (labs 7).
a. Topic coherence.
b. Perplexity.
c. Optimisation of models’ hyperparameters.

7. BERTopic algorithm (labs 8).

8. Supervised topic models (labs 9).
a. Supervised LDA (sLDA).
b. Making predictions with the BERTopic algorithm.

9. Hierarchical topic models (labs 10).
a. Hierarchical Dirichlet Process.
b. Hierarchical LDA (hLDA).
c. ‘Hierarchical’ BERTopic.

10. Time series analysis of the topic model’s output (labs 11).

11. Correlated topic models (labs 12).
a. Correlated Topic Model (CTM).
b. Pachinko Allocation Model (PAM).

12. Dynamic topic models (labs 13).

13. Students’ presentations (labs 14-15).

W cyklu 2025Z:

W ramach kursu omówione zostaną kolejno następujące zagadnienia:

1.Wprowadzenie do przetwarzania danych tekstowych:
a. tokenizacja,
b. stemming,
c. lemmatyzacja,
d. stopwords,
e. n-gramy,
f. metryki TF, IDF, TF-IDF.

2. Uwagi ogóle dot. modelowania tematów:
a. miary jakości modeli: topic coherence, topic similarity, topic diversity, perplexity, PMI, kryterium OpTop, singular BIC,
b. optymalizacja hiperparametrów modeli,
c. sposoby prezentacji wyników modeli.

3. Semantyczne modele tematów:
a. Latent Semantic Analysis,
b. Non-negative Matrix Factorisation.

4. Probabilistyczne modele tematów:
a. Probabilistic Latent Semantic Analysis,
b. Latent Dirichlet Allocation.

5. Algorytm BERTopic:
a. wprowadzenie do uczenia głębokiego,
b. osadzenia słów i zdań, duże modele językowe,
c. BERT,
d. UMAP,
e. HDBSCAN,
f. seed words
g. zero-shot model.

6. Hierarchiczne modele tematów:
a. Hierarchical Dirichlet Process,
b. Hierarchical Latent Dirichlet Allocation,
c. hierarchiczne podejście w BERTopic.

7. Skorelowane modele tematów:
a. Correlated Topic Model,
b. Pachinko Allocation Model,
c. ‘korelacje’ między tematami w BERTopic.

8. Dynamiczne modele tematów:
a. Dynamic Topic Model,
b. Dynamic BERTopic.

9. Nadzorowane modele tematów:
a. Supervised Latent Dirichlet Allocation,
b. nadzorowany BERTopic,
c. inne algorytmy.

Szacunkowy nakład pracy studenta:
Typ aktywności K (kontaktowe) S (samodzielne)
wykład (zajęcia) 0 0
ćwiczenia (zajęcia) 30 30
egzamin 0 0
konsultacje 5 0
przygotowanie do ćwiczeń 0 10
przygotowanie do wykładów 0 0
przygotowanie do kolokwium 0 0
przygotowanie do egzaminu 0 0
… 0 0
Razem 35 40 = 75

Rodzaj przedmiotu

nieobowiązkowe

Koordynatorzy przedmiotu

Maciej Świtała

Efekty kształcenia

Learning outcomes Students will learn how to collect textual data and prepare it for further analysis. Also, they will get to know the theoretical basis of various topic modelling algorithms. Students will be able to build different topic models depending on the issue they face. Furthermore, they will know how to measure a model's performance and compare it between different algorithms applied. FInally, at the end of the course students will be aware of current topic modelling challenges and problems.

Kryteria oceniania

Methods and criteria of evaluation Final grade is to be established based on points obtained for preparing a home-taken project (80%) and its presentation (20%).

Literatura

Literature Compulsory:

Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of machine learning research, 3(Jan), 993-1022.

Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794.

Kherwa, P., & Bansal, P. (2020). Topic modelling: a comprehensive review. EAI Endorsed transactions on scalable information systems, 7(24).

Ramos, J. (2003). Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning (Vol. 242, No. 1, pp. 29-48).

Stevens, K., Kegelmeyer, P., Andrzejewski, D., & Buttler, D. (2012). Exploring topic coherence over many models and many topics. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning (pp. 952-961).
Additional:

Aletras, N., & Stevensson, M. (2013). Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) – Long Papers (pp. 13-22).

Bahl, L. R., Jelinek, F., & Mercer, R. L. (1983). A maximum likelihood approach to continuous speech recognition. IEE transactions on pattern analysis and machine intelligence, (2), 179-190.

Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The annals of applied statistics, 1(1), 17-35.
Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J., & Blei, D. (2009). Reading tea leaves: How humans interpret topic models. Advances in neural information processing systems, 22.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391-407.

Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 50-57).

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory and acquisition induction, and representation of knowledge. Psychological review, 104(2), 211.

Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse processes, 25(2-3), 259-284.

Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 conference on empirical methods in natural language processing (pp. 262-272).

Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100- 108).

Roder, M., Both, A., & Hinnenburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the eight ACM international conference on Web search and data mining (pp. 399-408).

Wang, Y. X., & Zhang, Y. J. (2012). Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on knowledge and data engineering, 25(6), 1336-1353.

W cyklu 2025Z:

Podstawowa:

Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of machine learning research, 3(Jan), 993-1022. Retrieved from: dl.acm.org/doi/10.5555/944919.944937.
Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint, DOI: 10.48550/arXiv.2203.05794.
Kherwa, P., & Bansal, P. (2020). Topic modelling: a comprehensive review. EAI Endorsed transactions on scalable information systems, 7(24), DOI: 10.4108/eai.13-7-2018.159623.

Uzupełniająca:

Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning (pp. 113-120), DOI: 10.1145/1143844.1143859.
Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The annals of applied statistics, 1(1), 17-35, DOI: 10.1214/07-AOAS114.
Bystrov, V., Naboka-Krell, V., Staszewska-Bystrova, A., & Winker, P. (2024). Choosing the number of topics in LDA Models–a Monte Carlo comparison of selection criteria. Journal of Machine Learning Research, 25(79), 1-30. Retrieved from: jmlr.org/papers/v25/23-0188.html.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint, DOI: 10.48550/arXiv.1810.04805.
Griffiths, T., Jordan, M., Tenenbaum, J., & Blei, D. (2003). Hierarchical topic models and the nested Chinese restaurant process. Advances in neural information processing systems, 16. Retrieved from: proceedings.neurips.cc/paper/2003/hash/7b41bfa5085806dfa24b8c9de0ce567f-Abstract.html.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 50-57), DOI: dl.acm.org/doi/pdf/10.1145/312624.312649.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato\’s problem: The latent semantic analysis theory and acquisition induction, and representation of knowledge. Psychological review, 104(2), 211, DOI: 10.1037/0033-295X.104.2.211.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse processes, 25(2-3), 259-284, DOI: 10.1080/01638539809545028.
McAuliffe, J., & Blei, D. (2007). Supervised topic models. Advances in neural information processing systems, 20. Retrieved from: proceedings.neurips.cc/paper/2007/file/d56b9fc4b0f1be8871f5e1c40c0067e7-Paper.pdf.
Paatero, P., & Tapper, U. (1994). Positive matrix factorization: A non‐negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2), 111-126, DOI: 10.1002/env.3170050203.
Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint, DOI: 10.48550/arXiv.1908.10084.
Tunstall, L., Von Werra, L., & Wolf, T. (2022). Natural language processing with transformers. Building Language Applications with Hugging Face. 1st Edition. O'Reilly Media.
Wang, Y. X., & Zhang, Y. J. (2012). Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on knowledge and data engineering, 25(6), 1336-1353, DOI: 10.1109/TKDE.2012.51.

Więcej informacji

Dodatkowe informacje (np. o kalendarzu rejestracji, prowadzących zajęcia, lokalizacji i terminach zajęć) mogą być dostępne w serwisie USOSweb:

Strona przedmiotu 2400-ZEWW878 w USOSweb