Data analysis methods 2100-CB-M-D1MADA
1. Organizational Session
Substantive Introduction: The (limited) possibilities and (significant) barriers of quantitative data analysis. Quantitative data analysis – the origins of statistics as a scientific discipline. Subdisciplines of statistics (descriptive statistics, inferential statistics). Key concepts: population, population characteristics, sample.
Technical and Organizational Introduction: Overview of course requirements. Equipping course participants with the necessary software tools. Discussion of the syllabus, requirements, and course rules. Installation and configuration of open-source data analysis software.
2a. Have data on phenomena been falsified or do they exhibit alarming anomalies?
Categorization of mass phenomena according to types of distributions: normal distribution (e.g. height, IQ, income), Poisson distribution (e.g. number of emails/phone calls received per day), exponential distribution (time between failures/attacks in a computer system), uniform distribution (coin toss, dice roll, random number generation by humans). Binomial test, Kolmogorov-Smirnov test, and Shapiro-Wilk test. Where statistics fail: Taleb’s grey and black swans.
2b. How to detect outliers (false or anomalous values) in a data set (without much effort)
Introduction to “numerical methods”. “Rule of thumb”, Graph’s test, Grubbs’ test, Dixon’s test, and Chauvenet’s criterion. Which digits do forgers choose: Frank Benford’s Law of Anomalous Numbers. Other (potentially) useful regularities: power law, Pareto principle, Zipf’s law. Addendum: myths and (frightening) facts about the “bad luck streak” phenomenon.
2c. How reliable can your inferences from a dataset be, and what does it depend on?
On what criteria can you base statistical inference: dataset size and variability. A quantitative, objective measure of inference certainty: the maximum standard error of estimate. Methods of calculation, boundary conditions, and interpretation. Effect size as a universal evaluation metric instead of p-value (significance level) for multiple measurements.
3. Segment your data – the art of profiling groups and identifying dangerous phenomena in datasets
Data classification into groups (segments) using cluster analysis.
4. Is there a relationship between phenomena, and how strong is it? Discover connections between phenomena and agents
Introduction to covariance analysis. Selected measures of association: Pearson’s correlation coefficient (R), eta coefficient (η), chi-square (χ²), and Cramér’s V. Interpretation and misinterpretation of correlations (Anscombe’s quartet and spurious correlations).
5. The same or not? Examining differences between groups (i.e. a method of inference beyond "eyeballing")
Student’s t-test for dependent samples.
6. The art of forecasting phenomena – regression analysis
Linear regression as a basic forecasting method. Historical overview of regression analysis. Theoretical foundations of regression analysis. Calculation and analysis of linear regression. Multivariate (multiple) regression. The importance of instrumental variables. Introduction to building models of phenomena – possibilities and limitations.
7. Prediction probability and expert opinion agreement
How to assess the probability level of your hypotheses – odds ratio. Quantifying expert agreement on a given topic – Cohen’s kappa coefficient (κ).
8. Analysis of qualitative data – texts
Sentiment analysis – i.e. the amount of positive vs. negative emotion in a textual statement. Detection of hate speech. Plagiarism detection and authorship analysis. Automatic identification of place names in large texts and mapping them. Topic modeling of texts. Tools: [https://ws.clarin-pl.eu/](https://ws.clarin-pl.eu/). Automatic profiling of sociodemographic characteristics based on text: [https://applymagicsauce.com/demo](https://applymagicsauce.com/demo).
9. Statistics and aesthetics – principles of data presentation
Templates for analytical reports. Standards for data evaluation. Datavis/dataviz (data visualization) versus infographic. Infographics in Canva ([https://www.canva.com/pl_pl/](https://www.canva.com/pl_pl/)). Data visualization with RawGraphs ([https://rawgraphs.io/](https://rawgraphs.io/)): alluvial diagram, Gantt chart, dendrogram, Voronoi tessellations, and Sankey diagram. Proper selection of color schemes for presentations. D.M. Kessler’s color scheme system (Color Wheel). Tools: [http://paletton.com](http://paletton.com); [https://coolors.co/](https://coolors.co/); [https://color.adobe.com/pl/create/color-wheel/](https://color.adobe.com/pl/create/color-wheel/).
Term 2024Z:
None |
Term 2025Z:
None |
Course coordinators
Learning outcomes
Knowledge
The student will acquire knowledge of:
- Types and categories of open-source and proprietary software used for data analysis
- Classical tests for relationships between variables and differences between groups, which may be employed to identify falsified or erroneous datasets, as well as to detect anomalous units of analysis within those datasets (K_W05)
- Methods for assessing the credibility of datasets, particularly in situations where data may have been falsified or manipulated (K_W05)
- Capabilities and limitations of statistical analyses in investigative analytics (K_W05)
Skills
The student will acquire the following skills:
- Basic proficiency in effective installation, configuration, and operation of selected software for statistical analysis
- Detection of outliers, i.e., anomalous units of analysis within datasets (K_U02)
- Evaluation of large datasets in terms of their credibility (K_U02)
- Identification of dependencies and relationships between variables in datasets
- Classification and categorization of data sets
- Fundamentals of prediction based on collected data
- Application of acquired knowledge to cyber risk management (K_U02)
- Adaptation of statistical measures to the needs and problems of cybersecurity (e.g., analysis of logs/billings/telemetry data, attack prediction through detection of anomalies in network traffic, identification of potentially dangerous groups in social networks, probabilistic risk assessment of activities) (K_U02)
Competences
An attempt will be made to develop the following competences:
- The ability to assess phenomena of the surrounding reality in probabilistic terms
- Promotion of the need to perceive the world from a quantitative perspective (K_K01)
Assessment criteria
A project involving the preparation and execution of a task using the appropriately selected tools listed above. In certain cases, the subject of the final project may include an introduction to analytical issues.
Bibliography
Required Reading
- D. Mider, A. Marcinkowska, *Quantitative Data Analysis for Political Scientists: A Practical Introduction Using GNU PSPP*, ACAD, Warsaw 2013.
-Recommended YouTube videos from the DataCat series (CyberTeam channel) [official course instructor's account and videos].
Supplementary Reading
- S. Beduińska, M. Cypryańska, The Statistical Signpost. Part One: A Practical Introduction to Statistical Inference*, SWPS University, Warsaw 2013.
- P. Francuz, R. Mackiewicz, Numbers Don’t Know Where They Come From, John Paul II Catholic University of Lublin, Lublin 2007.
- J. Górniak, J. Wachnicki, First Steps in Data Analysis: SPSS PL for Windows, SPSS Polska, Kraków 2000.
- D. Larose, Discovering Knowledge from Data: An Introduction to Data Mining, Wydawnictwo Naukowe PWN, Warsaw 2006.
- M. Nawojczyk, A Guide to Statistics for Sociologists, SPSS Poland, Cracow 2002.
- N. N. Taleb, The Black Swan: The Impact of the Highly Improbable, Random House, New York 2007.
- N. N. Taleb, Antifragile: Things That Gain from Disorder, Random House, New York 2012.
Term 2024Z:
None |
Term 2025Z:
None |
Additional information
Additional information (registration calendar, class conductors, localization and schedules of classes), might be available in the USOSweb system: