Topic modelling 2400-ZEWW878
Full description of the course
(max 65.000 characters)
1. Introductory matters (labs 1).
a. What is topic modelling?
b. What is the procedure for obtaining topics and drawing conclusions?
c. Example practical applications of topic modelling.
2. Collecting textual data for topic modelling (labs 2).
a. Review of web scraping and crawling techniques.
b. Most common technical issues.
c. Ethics and possible legal problems.
d. Review of Python libraries: Selenium and Beautiful Soup with example codes.
3. Textual data preprocessing (labs 3).
a. Tokenization.
b. Stemming.
c. Lemmatization.
d. Stopwords.
e. N-grams.
f. TermFrequence (TF).
g. Inverse Document Frequency (IDF).
h. TF-IDF.
4. Semantic topic modelling algorithms (labs 4-5).
a. Latent Semantic Analysis (LSA).
b. Non-Negative Matrix Factorization (NNMF).
5. Probabilistic topic modelling algorithms (labs 5-6).
a. Probabilistic Latent Semantic Analysis (PLSA).
b. Latent Dirichlet Allocation (LDA).
6. Measures of models’ performance (labs 7).
a. Topic coherence.
b. Perplexity.
c. Optimisation of models’ hyperparameters.
7. BERTopic algorithm (labs 8).
8. Supervised topic models (labs 9).
a. Supervised LDA (sLDA).
b. Making predictions with the BERTopic algorithm.
9. Hierarchical topic models (labs 10).
a. Hierarchical Dirichlet Process.
b. Hierarchical LDA (hLDA).
c. ‘Hierarchical’ BERTopic.
10. Time series analysis of the topic model’s output (labs 11).
11. Correlated topic models (labs 12).
a. Correlated Topic Model (CTM).
b. Pachinko Allocation Model (PAM).
12. Dynamic topic models (labs 13).
13. Students’ presentations (labs 14-15).
Type of course
Course coordinators
Term 2024Z: | Term 2023Z: |
Learning outcomes
Learning outcomes Students will learn how to collect textual data and prepare it for further analysis. Also, they will get to know the theoretical basis of various topic modelling algorithms. Students will be able to build different topic models depending on the issue they face. Furthermore, they will know how to measure a model's performance and compare it between different algorithms applied. FInally, at the end of the course students will be aware of current topic modelling challenges and problems.
Assessment criteria
Methods and criteria of evaluation Final grade is to be established based on points obtained for preparing a home-taken project (80%) and its presentation (20%).
Bibliography
Literature Compulsory:
Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of machine learning research, 3(Jan), 993-1022.
Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794.
Kherwa, P., & Bansal, P. (2020). Topic modelling: a comprehensive review. EAI Endorsed transactions on scalable information systems, 7(24).
Ramos, J. (2003). Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning (Vol. 242, No. 1, pp. 29-48).
Stevens, K., Kegelmeyer, P., Andrzejewski, D., & Buttler, D. (2012). Exploring topic coherence over many models and many topics. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning (pp. 952-961).
Additional:
Aletras, N., & Stevensson, M. (2013). Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) – Long Papers (pp. 13-22).
Bahl, L. R., Jelinek, F., & Mercer, R. L. (1983). A maximum likelihood approach to continuous speech recognition. IEE transactions on pattern analysis and machine intelligence, (2), 179-190.
Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The annals of applied statistics, 1(1), 17-35.
Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J., & Blei, D. (2009). Reading tea leaves: How humans interpret topic models. Advances in neural information processing systems, 22.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391-407.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 50-57).
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory and acquisition induction, and representation of knowledge. Psychological review, 104(2), 211.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse processes, 25(2-3), 259-284.
Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 conference on empirical methods in natural language processing (pp. 262-272).
Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100- 108).
Roder, M., Both, A., & Hinnenburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the eight ACM international conference on Web search and data mining (pp. 399-408).
Wang, Y. X., & Zhang, Y. J. (2012). Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on knowledge and data engineering, 25(6), 1336-1353.
Additional information
Additional information (registration calendar, class conductors, localization and schedules of classes), might be available in the USOSweb system: