Practical distributed systems 1000-2M21PRS

Handling of many requests per second within a short time limit is a real challenge. In terms of a distributed system, when it is necessary to collect large amounts of data, many architectural and implementational problems must be resolved.

This course describes problems and issues related to the implementation of large-scale distributed systems
and is based on the experiences coming from the actual implementations of such systems. We will discuss the practical aspects of building high-throughput systems which process petabytes of data daily in geographically dispersed data centers. We will discuss common problems and consider decisions related to the maintenance and development of such systems. We will look at techniques for effective data exchange between system components, issues related to the storage and processing of large amounts of data.
We will also deal with the practical aspects of organisation of infrastructure supporting machine learning in the realities of large-scale systems.

The aim of the laboratories will be to create a working distributed system which purpose will be to handle large amounts of traffic. The system will be built incrementally. During the first class, the foundations of the system will be created based on the minimum requirements. In the course of subsequent classes, further requirements will be presented which will make it necessary for students to expand the system.

1. Introduction: requirements and compromises related to large-scale systems, examples of architecture.
2. Scalability of systems, division into data centers, high availability, load-balancing, network infrastructure.
3. Implementation of applications in a distributed environment (Docker, Kubernetes), management of hardware infrastructure (Puppet, Ansible).
4. Monitoring the health of large-scale systems (Graphite, Grafana, Icinga).
5. Effective communication between components of distributed systems (Kafka).
6. Data stream processing methods (Kafka Streams, Kafka Workers).
7. Data storage and data synchronization (polyglot persistence, Aerospike NoSQL databases, FoundationDB, Cassandra, Memcached).
8. Management of data structures and schemas in distributed systems (Avro, Schema Registry).
9. Organization and implementation of infrastructure supporting machine learning and data analysis in a distributed systems environment (MapReduce, Spark, distributed file systems - HDFS).
10. Cloud solutions - Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Serverless approaches, advantages and disadvantages, hybrid model, cost analysis, sample use cases (Google Compute Engine, BigQuery, Cloud Storage).

Type of course

elective monographs

Mode

Classroom

Learning outcomes

Knowledge:
1. Knows the issues related to reliability engineering in the realities of large systems.
2. Knows the methods and tools supporting the implementation of applications in a distributed environment.
3. Knows the issues related to the containerization of the application.
4. Knows the methods and tools for monitoring the health of infrastructure and applications.
5. Knows use cases and the architecture of pub / sub systems (Kafka).
6. Knows libraries for processing data streams from the Kafka ecosystem.
7. Knows the types of popular non-relational databases and their applications in large-scale distributed systems.
8. Knows issues related to the versioning and compatibility of data structures used in distributed systems.
9. Knows the basic aspects related to the storage and optimization of large data processing.
10. Has knowledge of the complexity analysis of distributed algorithms and big data processing algorithms.
11. Knows the principles of designing algorithms consistent with the MapReduce paradigm.
12. Understands the advantages and disadvantages of using cloud services in the architecture of a distributed system.

Skills:
1. Can design the architecture of a large-scale distributed system.
2. Can make the right decisions regarding the necessary compromises when designing distributed systems.
3. Can configure the basic elements of the system responsible for maintaining high reliability.
4. Can create containerized applications.
5. Can configure the central log of events in a distributed environment.
6. Can configure the collection of basic resource usage metrics on servers.
7. Can configure a Kafka cluster and implement communication using it.
8. Can implement an application that processes data streams, which can be conveniently and dynamically scaled.
9. Can adequately select a non-relational data source for a problem class in a large-scale distributed system.
10. Can express problems in distributed computing models such as MapReduce.
11. Can diagnose bottlenecks in distributed algorithms for data processing.

Assessment criteria

Final grade on the basis of the project developed during the semester.

Bibliography

1. Site Reliability Engineering - Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy
2. Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems - Martin Kleppmann
3. Fundamentals of Software Architecture: An Engineering Approach - Mark Richards
4. Making Sense of Stream Processing - Martin Kleppmann
5. Kafka: The Definitive Guide - Neha Narkhede, Gwen Shapira, Todd Palino
6. Microservices: Up and Running: A Step-By-Step Guide to Building a Microservices Architecture - Ronnie Mitra, Irakli Nadareishvili

Additional information

Information on level of this course, year of study and semester when the course unit is delivered, types and amount of class hours - can be found in course structure diagrams of apropriate study programmes. This course is related to the following study programmes:

Additional information (registration calendar, class conductors, localization and schedules of classes), might be available in the USOSweb system:

Description of 1000-2M21PRS in USOSweb