Big data mining and processing 1000-2M13DZD

The course topics can be divided into the following sections:

1. Overview through selected methods used in machine learning and data analysis (e.g. rule induction, feature selection, cluster analysis, and on the other hand XGBoost, SVM, various neural network architectures, etc. - we assume that such methods are already partially known to participants) in terms of problems concerning partially distributed and big data.

2. Discussion on selected methods in machine learning and data analysis in terms of understanding their results as solutions of given optimization problems, formulated on the input data. In particular, complexity of these problems in case of big data, where we tend to favour heuristic, randomized algorithms or solutions provided by artificial intelligence (evolution algorithms, simulated annealing, etc.). A more general discussion over connections between machine learning (ML) and artificial intelligence (AI) - noticing that AI and ML are not identical, but their domains are crucial to each other.

3. Discussion on typical IT systems scenarios due to the various types of data flow and related challenges for machine learning and data analysis, e.g. variants of performing local operations during data processing before registering data in fully scalable infrastructure.

4. Integration of functionalities and needs related to machine learning and data analysis methods, such as databases (both SQL and noSQL) or Business Intelligence systems. Two categories: a) ML functions called on the level of interfaces (e.g. SQL functions); b) examples of using interfaces by ML algorithms (e.g. scripts that compute ML models based on automatically generated analytical SQL queries).

5. Discussion on various approaches in data and models compactification in order to improve machine learning and data analysis performed on big data (including streaming data and high dimensionality data). Compactification may include: a) data compression and quantization; b) particular implementations of machine learning and knowledge discovery models based on approximate computing (e.g. by sampling data); c) dimensionality reduction, feature selection and extraction, simplifying models by manually reducing parameters. Moreover hybrid scenarios, based on e.g. fine-tuning hyperparameters using stricter compactification (which is faster) and learning the final model more carefully.

6. Discussion on problems related to purification of large, multi-modal, heterogeneous, multidimensional data for the purposes of machine learning and data analysis. Scenarios in which data - although big - are not suitable for learning models (e.g. lack of labels related to concepts / situations / objects we are interested in) and where we need to launch appropriate processes to make this possible (which may differ depending on the needs of expert knowledge of data labelling), e.g. related to interactive search for representative examples in data repositories. Also scenarios where errors may appear in the training data (junk data), due to e.g. measurements or labels making learning models may be less accurate.

7. Challenges related to maintaining the effectiveness of models achieved using machine learning and data analysis, seen as components of a larger IT system. Taking care of the processes of fine-tuning and training models using new data, which can be designed in a different ways due to the size of the data, dynamics of the growth of data and the speed of using and adjusting models, which is required in various business application scenarios. Diagnostics of models due to the error they make, using i.a. explanation and visualization techniques.

8. Practical application scenarios for data discovery processes, including setting analytical goals, data preprocessing and applying machine learning and data analysis methods. Examples of such implementations related to big data competitions, organized online (e.g. Knowledge Pit), including cooperation with sponsors of competition, providing feasible data for the competition (including data anonymization, maintaining data quality, connection with problem solved during competition), implementing results as prototypes of solutions that turned out to be successful in the competition, but may still require work.

Main fields of studies for MISMaP

computer science
mathematics

Course coordinators

Dominik Ślęzak

Type of course

elective monographs
optional courses

Mode

Classroom
Remote learning
Blended learning

Prerequisites

Data mining
Big data processing and cluster computing

Prerequisites (description)

Both theoretical and practical foundations of machine learning, data mining, statistical data analysis, as well as data processing and databases can significantly help in effectively acquiring knowledge during this course. The subject also extends the scope of basic subjects in artificial intelligence and big data processing, notwithstanding students can successfully improve their knowledge in this field even during the course.

Learning outcomes

Knowledge and skills:

-- In line with 8 main points on the topics.

Social competences:

-- Can prepare and present a report on the analysis of practical big data, where the analysis is carried out using the methods of data mining and machine learning discussed in class.

-- Can point (in a non-specialized language, aimed at potential users of analytical systems, and not necessarily experts in the field of machine learning, data mining, or the so-called data science), which big data problems (e.g. size, dimensionality, multimodality, quality and variability of data etc.) may happen during the processing and mining of specific practical dataset.

Assessment criteria

As about the exercises, during the semester participants will implement a project related to the subject of the course. The project may take the form of participation in a competition related to the analysis of big datasets (e.g. on the Knowledge Pit). Projects can be carried out individually or in pairs. Each project should end with a presentation. Presentations will be given in the last week of semester (in the case of PhD candidates, an earlier presentation is possible). Presentations will be the basis for passing the exercises.

As about the lecture, the basis for passing it will be preparation of a presentation based on an arbitrary article published in the series of the IEEE Big Data conferences or an article published elsewhere, if it is related to the interests of a student and the topics of the lecture, and if it is accepted by the lecturer. Presentations will be delivered in the last month of the course.

In order to receive the final grade on the first date, the exercises and the lecture must be passed. Final grade in the second term (September) will be determined during the oral exam comprising presentation of an article (see the criteria for passing the lecture) and presentation of a finalized project (see the criteria for passing the exercises).

Bibliography

The course will partially be based on the course "Mining of Massive Datasets" (mmds.org). You can find useful examples, presentations and videos on the course website. The following literature is related:

1. Anand Rajaraman and Jeff Ullman: "Mining of Massive Datasets"

2. Jiawei Han and Micheline Kamber: "Data Mining, Concepts and Techniques"

3. Gregory Piatetsky-Shapiro: "KDnuggets"

4. IEEE Big Data Conferences

The latest materials from these conferences will be provided. They include, but are not limited to, articles describing machine learning competitions (e.g. organized on the Knowledge Pit platform, which can also be used as an independent source of information and data), and may also be useful for the preparation of projects by PhD candidates attending classes.

Additional information

Information on level of this course, year of study and semester when the course unit is delivered, types and amount of class hours - can be found in course structure diagrams of apropriate study programmes. This course is related to the following study programmes: