Data analysis and visualization 1000-719DAV
The students will learn how to process and visualize the data (in most common formats e.g., csv, json, xml) using scripting language (Python). This include using build-in libraries and writing custom parsers.
The course will have two parts:
Part 1 – Introduction to Python programming (jupyter)
Part 2 – Data analysis and visualization (numpy, pandas, scip, matplotlib, seaborn, plotly, ImageMagick)
• static plots
• interactive and animated plots
The students will be able to get hands-on the most popular methods of data analysis and visualization (including working with multivariable data).
The general knowledge presented during lectures will be used during the exercises in front of the computer. All exercises and projects will be done using Python programming language.
The lectures:
1) Introduction to the Python.
2) Jupyter.
3) Data sets. The most common data sets (e.g., Anscombe's quartet, Iris, MNIST) and formats (csv, json, xml, fastaq).
4) Data sets. Pre-processing using build-in libraries and writing custom parsers (numpy, pandas).
5) Statistic analysis. Mean average, variance, correlation, linear regression (scipy).
6) Statistical classification. Decision trees. Random forests. Support vector machines. (Deep) neural networks.
7) Data visualization. Using Python ploting libraries (matplotlib, seaborn, plotly, ImageMagick).
8) Data visualization. Graphics (colors, lines, etc.) and their use in data presentation. Transformation of variables for better visibility. Time scales. Different types of plots (scatter, pie, bar, histogram, heatmap, boxplot).
9) Data visualization. The most common errors during plotting. The importance of colors on the plot. The perception of the data depending on the complexity and the type of the plot.
10) Plot customization. Legend. Colors. Axes (scale of measure: nominal, ordinal, interval, logarithmic and ratio).
11) Static vs. interactive and animated ploting.
Type of course
Course coordinators
Learning outcomes
Knowledge
1. Has general knowledge of programming.
2. Has knowledge on programming constructs and syntax of the Python programming language (assignment, control instructions, subroutine call and parameter passing).
3. Has knowledge on data structures and operations on them.
4. Has knowledge on information management, in particular in database systems, data modelling, data storage and information retrieval.
Skills
1. Is able to apply mathematical knowledge to formulation, analysis and solving of computing problems on medium level of difficulty.
2. Is able to obtain information using literature, knowledge bases, Internet and other credible sources, integrate and interpret it as well as draw conclusions and formulate opinions.
3. Is able to write, run and test programs in a chosen programming environment.
4. Ia able to program algorithms; to this end uses basic algorithmic techniques and data structures.
5. Is able to evaluate, on the basic level, the usefulness of routine programming techniques and tools as well as to chose and apply an appropriate ones.
6. Knows at least one foreign language on an intermediate level as well as English on the level that makes it possible to read and understand software documentation, handbooks and articles in the field of computer science.
Competences
1. Is aware of the necessity to systematically work on programming projects.
2. Understands and appreciates the significance of the intelectual honesty in own activites and activities of the others; is ethical.
3. Is able to work individually, in particular manages own time and keeps deadlines.
Assessment criteria
The final score depends on the project and syllabus.
"Project" - 50%, "Syllabus" - 50% of the grade.
To pass, 60% from both the syllabus and the project is needed.
The syllabus: the attendance in the lectures (20%) and laboratories (80%). Thus, if there are 10 lectures and 10 laboratories, each lecture gives 2% of the syllabus i.e. 1% of the final grade. Moreover, each laboratory and homework (if any) is assessed and count for max of 8% of the syllabus and 4% of the final grade).
The project: the student(s) will need to collect and interpret the data and finally present it using appropriate plots (static, interactive, and animated). Both interactive (html, no size limit) and static (pdf, A0 format poster) formats are required.
Bibliography
1. Dive Into Python 3 (http://histo.ucsf.edu/BMS270/diveintopython3-r802.pdf)
2. Python Data Analysis, Ivan Idris, 2014
3. Python for Data Analysis, Wes MacKinney, 2013
4. [In Polish] Zbiór esejów o sztuce pokazywania danych, P. Biecek, 2014 (http://www.biecek.pl/Eseje/).
homepage:https://www.mimuw.edu.pl/~lukaskoz/teaching/dav/
Additional information
Information on level of this course, year of study and semester when the course unit is delivered, types and amount of class hours - can be found in course structure diagrams of apropriate study programmes. This course is related to the following study programmes:
- Master's degree, second cycle programme, Bioinformatics and Systems Biology
- Master's degree, second cycle programme, Mathematics
Additional information (registration calendar, class conductors, localization and schedules of classes), might be available in the USOSweb system: