Big data processing and cluster computing 1000-218bPDD

1. Hadoop Distributed File System (HDFS)
2. MapReduce model
3. Basic algorithmic techniques for MapReduce model and methods for analysing algorithms presented on typical examples (multiplying matrix by vector; multiway joins; sorting, ranking, perfect splitting; triangles counting in large graphs)
- computation vs communication cost
- total vs elapsed communication cost
- methods for limiting reducer memory
- methods for combating skew
4. Methods for effective and portable data serialization (e.g. Avro)
5. Cloud platforms: Amazon, Google, Microsoft, IBM
6. Distributed processing of large graphs (BSP and Pregel models)
7. Examples of the most important large graphs processing problems, e.g., PageRank and community detection
8. Spark and Resilient Distributed Dataset
9. Columnar data format (e.g. Parquet)
10. Spark SQL and Catalyst optimizer

Type of course

obligatory courses

Course coordinators

Jacek Sroka

Learning outcomes

Knowledge:
1. Understands the MapReduce model and knows how to use it to solve basic problems like relational algebra operations or multiplying matrix by vector (K_W01)
2. Has knowledge about complexity of distributed algorithms and algorithms for big data processing (K_W01, K_W02)
3. Has knowledge about basic algorithmic techniques for big data processing like minimal algorithms (K_W01)
4. Has knowledge about main available cloud infrastructures (K_W06)
5. Has knowledge about techniques for data serialization (K_W01)

Skills:
1. Can analyse complexity of big data algorithms and compare such algorithms, can choose right algorithm for a given use case (KU_01)
2. Can express solutions to problems in most important models of big data computation like MapReduce (KU_02, KU04)
3. Can diagnose bottlenecks in big data algorithms (KU_07)
4. Can use frameworks like Hadoop and Spark (KU_08)
5. Can serialize/deserialize data in sequential and columnar frameworks like Avro i Parquet (KU_08)6. Can configure a cluster with Hadoop and Spark (KU_08)
7. Can run processing tasks on cloud infrastructure (KU_08)
8. Can follow tutorials on big data processing topics (KU_15)

Competences:
1. Knows the most important libraries with big data algorithms like Spark GraphX, Spark MLlib and Apache Mahout (K_K01)
2. Can diagnose problems and find their solutions in Internet community portals like Stack Overflow (K_K02)

Assessment criteria

Lab is graded based on big programming assignments and points for work during the classes. To be admitted to the first term exam one needs to get at least half of the possible points from labs. Big programming assignments submitted after the deadline get a penalty or won't be graded at all if the overtime is too big. First term grade is based on labs and exam in total. Second term grade is base on exam points only.

For PhD students there is an extra requirement to read and present one of current research papers on topics related to the lecture (the choice of the papers need to be accepted by the lecturer).

Bibliography

- Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer. Morgan & Claypool Publishers, 2010
- Mining of Massive Datasets. Anand Rajaraman, WalmartLabs, Jeffrey David Ullman, Stanford University, California
- Hadoop: The Definitive Guide, 4th Edition, Storage and Analysis at Internet Scale, Tom White, O'Reilly Media, 2015

Additional information

Information on level of this course, year of study and semester when the course unit is delivered, types and amount of class hours - can be found in course structure diagrams of apropriate study programmes. This course is related to the following study programmes:

Additional information (registration calendar, class conductors, localization and schedules of classes), might be available in the USOSweb system:

Description of 1000-218bPDD in USOSweb