Large-scale machine learning 1000-319bBML
-Distributing computation to clusters of commodity machines and distributed file system.
-MapReduce model and basic algorithmic techniques for this model. Comparing of MapReduce algorithms and typical algorithms for typical problems (matrix multiplication, multi-way join, counting triangles in large graphs).
-Total vs elapsed communication cost. Skew and methods to deal with it.
-Spark and Resilient Distributed Dataset model.
-Spark SQL and its optimizations.
-Serialization of Big data and columnar formats.
-Managed cloud data warehouse.
-Algorithms for stream pressing.
-Distributing typical machine learning algorithms, e.g., linear regression, clustering, decision trees or neural networks.
-Neural networks in large scale (data parallelism, model paralelizm).
-Learned index structores.
Type of course
Requirements
Prerequisites (description)
Course coordinators
Term 2024Z: | Term 2023Z: |
Assessment criteria
Final mark based big programming assignments, points for participation in laboratories and written exam.
Bibliography
-Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. Mining of Massive Datasets. Cambridge University Press
-Guglielmo Iozzia, Hands-On Deep Learning with Apache Spark, Packt Publishing
-Butch Quinto, Next-Generation Machine Learning with Spark: Covers XGBoost, -LightGBM, Spark NLP, Distributed Deep Learning with Keras, and More, Apress
Additional information
Information on level of this course, year of study and semester when the course unit is delivered, types and amount of class hours - can be found in course structure diagrams of apropriate study programmes. This course is related to the following study programmes:
- Bachelor's degree, first cycle programme, Computer Science
- Master's degree, second cycle programme, Computer Science
Additional information (registration calendar, class conductors, localization and schedules of classes), might be available in the USOSweb system: