Data engineering 1000-2M23DE

The course will go from the basics of the data engineering task and show what is different about such a system from something like a personal blog or e-market platform. Shortly defining areas where data engineering approaches make sense. After this will give an overview of the file formats and why it important, and tries to show the decomposition of the general idea of the database. Describing a way to store data in the system. Demonstrating how to implement some processing tasks in the context of the data pipeline and giving tooling on how to conduct or orchestrate independent tasks into a single pipeline. In addition, will be describing tools
such as queues to conduct different elements of the data engineering system with each other as well as with elements outside it.

1. Introduction, MAD, MDS, Data Engineering life cycle, sources of information and self-education
2. Evolution of Data Engineering, Lambda architecture, KAPPA, cloud native, storage and computer separation
3. Source system
4. Data modelling, transformation, DAG, Spark
5. Data warehouse, data lake, lake house
6. Data governance, Data Hub
7. Streams vs queues, Spark, Pulsar
8. Decomposition, orchestrations, Prefect
9. Consumers, Superset
10. Quality, security, observability
11. Data Engineering architecture and with whom we work
12. Project demo
13. Summary

Koordynatorzy przedmiotu

Yura Braiko

Rodzaj przedmiotu

monograficzne

Tryb prowadzenia

zdalnie

Efekty kształcenia

Understanding the basic principles of most data processing tasks & the mechanics of modern tools

Kryteria oceniania

- Lab projects.
- If some LMS is used – the topic assessments with peer review. assessments included.
- Final project.

Literatura

1. Designing Data-Intensive Applications. Must read(even reread) book.
2. Data Engineering with Apache Spark, Delta Lake, and Lakehouse: Create scalable pipelines that ingest, curate, and aggregate complex data in a timely and secure way book to get structural knowledge about the tools' family of data bricks.
3. Kafka in Action For one who doesn’t want to read the docs. Will be out of date in 1-2 years, but now it is good to get intuition.
4. The Log: What every software engineer should know about real-time data unifying abstraction must read the article (yep, it is ok that it is from 2013) and a good blog to read in general https://engineering.linkedin.com/blog/topic/distributed-systems.
5. How to beat the CAP theorem.
6. Questioning the Lambda Architecture.
7. How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh.
8. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics.
9. Towards Data Science – as a source of some news, good for beginners.
10. https://medium.com/the-prefect-blog lot of articles that are good to read for beginners (i.e. https://medium.com/the-prefect-blog/are-you-an-accidental-data-engineer-6b60e0f51286 can skip everything, which related to Prefect directly)

Więcej informacji

Więcej informacji o poziomie przedmiotu, roku studiów (i/lub semestrze) w którym się odbywa, o rodzaju i liczbie godzin zajęć - szukaj w planach studiów odpowiednich programów. Ten przedmiot jest związany z programami: