Data analysis and visualization 1000-719DAV
The students will learn how to process and visualize the data (in most common formats e.g., csv, json, xml) using scripting language (Python). This include using build-in libraries and writing custom parsers.
The course will have two parts:
Part 1 – Introduction to Python programming (jupyter)
Part 2 – Data analysis and visualization (numpy, pandas, scip, matplotlib, seaborn, plotly, ImageMagick)
• static plots
• interactive and animated plots
The students will be able to get hands-on the most popular methods of data analysis and visualization (including working with multivariable data).
The general knowledge presented during lectures will be used during the exercises in front of the computer. All exercises and projects will be done using Python programming language.
The lectures:
1) Introduction to the Python.
2) Jupyter.
3) Data sets. The most common data sets (e.g., Anscombe's quartet, Iris, MNIST) and formats (csv, json, xml, fastaq).
4) Data sets. Pre-processing using build-in libraries and writing custom parsers (numpy, pandas).
5) Statistic analysis. Mean average, variance, correlation, linear regression (scipy).
6) Statistical classification. Decision trees. Random forests. Support vector machines. (Deep) neural networks.
7) Data visualization. Using Python ploting libraries (matplotlib, seaborn, plotly, ImageMagick).
8) Data visualization. Graphics (colors, lines, etc.) and their use in data presentation. Transformation of variables for better visibility. Time scales. Different types of plots (scatter, pie, bar, histogram, heatmap, boxplot).
9) Data visualization. The most common errors during plotting. The importance of colors on the plot. The perception of the data depending on the complexity and the type of the plot.
10) Plot customization. Legend. Colors. Axes (scale of measure: nominal, ordinal, interval, logarithmic and ratio).
11) Static vs. interactive and animated ploting.
Type of course
Course coordinators
Learning outcomes
Knowledge
1. Has general knowledge of programming.
2. Has knowledge on programming constructs and syntax of the Python programming language (assignment, control instructions, subroutine call and parameter passing).
3. Has knowledge on data structures and operations on them.
4. Has knowledge on information management, in particular in database systems, data modelling, data storage and information retrieval.
Skills
1. Is able to apply mathematical knowledge to formulation, analysis and solving of computing problems on medium level of difficulty.
2. Is able to obtain information using literature, knowledge bases, Internet and other credible sources, integrate and interpret it as well as draw conclusions and formulate opinions.
3. Is able to write, run and test programs in a chosen programming environment.
4. Ia able to program algorithms; to this end uses basic algorithmic techniques and data structures.
5. Is able to evaluate, on the basic level, the usefulness of routine programming techniques and tools as well as to chose and apply an appropriate ones.
6. Knows at least one foreign language on an intermediate level as well as English on the level that makes it possible to read and understand software documentation, handbooks and articles in the field of computer science.
Competences
1. Is aware of the necessity to systematically work on programming projects.
2. Understands and appreciates the significance of the intelectual honesty in own activites and activities of the others; is ethical.
3. Is able to work individually, in particular manages own time and keeps deadlines.
Assessment criteria
The final score is based on both the project and the syllabus, each contributing 50% to the final grade.
To pass, you need at least 60% in both the syllabus and the project.
The Project: Students will collect and interpret data, then present their findings using appropriate visualizations, including static, interactive, and animated plots. Both an interactive format (HTML, no size limit) and a static format (PDF, A0 poster) are required.
The Syllabus: After each laboratory session, there will be exercises to complete, some of which may need to be finished at home. You will have one week (until Saturday at midnight) to submit them as homework. For reference, last year, students could earn up to 1,200 points, with each week's workload worth approximately 100 points.
Attendance at laboratory exercises is mandatory (you may miss only 2 exercise sessions without justification, but this does not exempt you from submitting homework on time).
Bibliography
1. Dive Into Python 3 (http://histo.ucsf.edu/BMS270/diveintopython3-r802.pdf)
2. Python Data Analysis, Ivan Idris, 2014
3. Python for Data Analysis, Wes MacKinney, 2013
4. [In Polish] Zbiór esejów o sztuce pokazywania danych, P. Biecek, 2014 (http://www.biecek.pl/Eseje/).
homepage:https://www.mimuw.edu.pl/~lukaskoz/teaching/dav/
Additional information
Information on level of this course, year of study and semester when the course unit is delivered, types and amount of class hours - can be found in course structure diagrams of apropriate study programmes. This course is related to the following study programmes:
- Master's degree, second cycle programme, Bioinformatics and Systems Biology
- Master's degree, second cycle programme, Mathematics
Additional information (registration calendar, class conductors, localization and schedules of classes), might be available in the USOSweb system: