Corpus Linguistics 3200-M1-2LK
The course programme covers theoretical knowledge concerning the structure of language corpora and the practical creation of text corpora, their analysis and possible practical applications:
1. The concept of a language corpus. Language corpus versus text collection. Theoretical and material research: the role of linguistic data in linguistics
2. Typology of corpora: monolingual and multilingual, parallel and comparable corpora. The concept of representativeness and adequacy of corpora.
3. Basic information on the indexing of language corpora; skills in interpreting data and using the morphological and syntactic information obtained
4. Basic Polish language corpora. The NKJP and WKJP corpora with accompanying tools.
5. Basic corpora of languages taught in language areas.
6. Available tools for text corpus analysis (AntConc, Jasnopis, Korpusomat, etc.).
7. Tools for parametric text analysis, practical applications of text modality measurement (Pantext).
8. Possible applications of corpora in linguistic practice:
a) research into specialised languages
b) corpora as a tool to assist translators
c) foreign language teaching
d) dictionaries and dictionary models on various data carriers
9. Text corpora and parallel texts in working with translation assistance programmes (possible integration of results from Term Base CAT tools).
The aim of the course is to familiarise students with basic IT tools supporting the process of collecting, verifying and applying lexical, stylistic and syntagmatic structures appropriate for general or specialised language. After completing the course, students should be able to use basic digital linguistics software and search for the necessary linguistic data using the tools they have learned.
The course is conducted with the use of presentations and visualisations of the operation of individual programmes, as far as the available equipment allows.
In addition, the course emphasises independent software search and drawing individual conclusions from linguistic analyses and their visualisations.
The knowledge acquired will be used by the student (group of students) to create a text corpus project, formally defined in class, which will constitute the basis for passing the course and increasing the students' competence in the field of creating independent lexicographical and corpus works (including conducting analyses for the purposes of a thesis).
Student workload (3 ECTS):
30 hours of classroom attendance (1)
30 hours of preparation of text corpora and analytical data (1)
15 hours of independent work with software (0.5)
15 hours of reading and project preparation (0.5)
|
Term 2025L:
The course programme covers theoretical knowledge concerning the structure of language corpora and the practical creation of text corpora, their analysis and possible practical applications. The final project involves the use of data generated by chats (artificial intelligence) for comparative analysis. The aim of the course is to familiarise students with the basic IT tools that support the process of collecting, verifying and applying lexical, stylistic and syntagmatic structures appropriate for general or specialised language. After completing the course, students should be able to use basic digital linguistics programmes and search for the necessary linguistic data using the tools they have learned. The course is conducted with the use of presentations and visualisations of the operation of individual programmes, as far as the available equipment allows. The knowledge acquired will be used by the student (group of students) to create a text corpus project, formally defined during the course, which will constitute the basis for passing the course and increasing the students' competence in the field of creating independent lexicographical and corpus works (including conducting analyses for the purposes of a thesis). |
Type of course
Mode
Prerequisites (description)
Course coordinators
Learning outcomes
The student improves his or her qualifications with regard to the following criteria:
Knowledge:
The student knows the terminology used in linguistics and related fields at the extended level, is familiar with the most important directions and methods of linguistic research; understands grammatical terminology; has knowledge of selected pragmatic conditions of given language systems;
has in-depth knowledge of methodology and conducting linguistic or literary research; knows scientific style and lexis; has knowledge of databases for linguistics, has basic knowledge of interpretation of data obtained from analysis;
knows popular computer programs supporting translator's work (CAT) as well as selected programs for attendance and styleometric analysis; knows the possibilities of using machine translation;
Skills:
uses computer programs useful in the translator's work, can properly format text in Polish and at least one foreign language; is able to efficiently use spreadsheets and charts; is able to use generally available scientific databases (including terminological and corpus databases); is able to efficiently search for information, uses expert knowledge, encyclopaedic, linguistic, general scientific, general technical, interdisciplinary and industry dictionaries, language corpora, databases, parallel texts;
can identify gaps in scientific research and directions of its continuation; formulates research problems, selects adequate methods, constructs research tools, develops, presents and interprets research results, draws conclusions;
Social competences:
is aware of the need to constantly search for new dictionary and text sources, as well as to follow modern scientific theories; responds quickly to changing realities;
draws conclusions from feedback, knows how to manage time; maintains contact with the translator community, works in a multicultural environment; knows the translator's working environment;
can work in a group, collaborate with others, take on appropriate roles (functions); manage a small team (3-4 people in groups);
Assessment criteria
Methods of assessing student work
- assessment of activity and ongoing preparation for classes;
- a minimum of 2 corpus projects during the course (thematic text corpus and comparative analysis of 2 corpora);
- final written assessment (thematic project - stylometric analysis of the corpus).
Assessment criteria (components of the final assessment):
- continuous assessment during classes: 10%
- projects during the course (thematic text corpus and comparative analysis of two corpora): 40%
- final assessment: 50%
Final assessment (or exam):
The final project is assessed according to a score proportional to the estimated workload (100%), broken down into individual parts of the project:
- preparation of representative text corpora: 20%
- stylometric characteristics of the collected material for analysis: 20%
- analysis of the research theses and their justification: 40%
- final conclusions, assessment of work with the software, technical comments: 20%
The final project accounts for 50% of the final grade.
Credits/projects during the course: 40% of the final grade
Activity, ongoing preparation for classes: 10% of the final grade
Scoring rules for calculated grades:
55%-69% = 3
70%-74% = 3+
75%-84% = 4
85%-89% = 4+
90%-100% = 5
Rules for cooperation between the lecturer and students:
1. Absences – 3 unexcused absences per semester are allowed (this is in accordance with the regulations).
2. The final assessment can be taken after completing the projects carried out during the course and receiving a positive grade for class work/activity.
3. Students have the right to retake each assessed task twice. Failure to take the test on the first date or failure to complete the task on the first date without justification will result in the loss of that date.
Practical placement
---
Bibliography
Basic literature::
- Gruszczyńska E., Leńko-Szymanska A. (red.), Polskojęzyczne korpusy równoległe / Polish-language Parallel Corpora, WLS UW, Warszawa, 2016.
- Karpiński Ł., Systemy leksykalno-komunikacyjne, Campidoglio, Warszawa, 2017.
- Karpiński Ł., Maszynowa charakterystyka tekstów specjalistycznych na potrzeby terminologicznych baz danych, [w:] "Komunikacja Specjalistyczna", t. 14/2017, s. 139-163.
- Karpiński Ł., Analiza danych stylometrycznych i modalnościowych oraz pomiar elokwencji na przykładzie korpusów tekstowych dotyczących publicystyki na temat kluczowych momentów operacji specjalnej w Ukrainie, [w:] "Language and Literary Studies of Warsaw, t. 12-13/2022-2023, s. 123-152.
- Hebal-Jezierska, Grabowski Ł., O różnych korpusowych metodach badawczych - próba krytycznej refleksji, [w:] "Komunikacja Specjalistyczna", t. 10/2016, s. 65-84.
- "Prace filologiczne", tom. LXIII, WP UW, Warszawa 2012 (tom zawierający zbiór prac dot. lingwistyki korpusowej)
Supplementary literature:
- Biber C., Corpus Linguistics. Investigating language structure and use, Cambridge Univesrity Press 1998.
- Celiński P., 2013a, Postmedia. Cyfrowy kod i bazy danych, Wydawnictwo UMCS, Lublin.
- Hebal-Jezierska M., Podstawowe zasady korzystania z korpusów przy badaniu języka, [w:] "Prace Etnograficzne", 2018, Tom 46, Numer 1, s. 30-49
- Kamińska-Szmaj I., 1989, Słownictwo tekstów popularnonaukowych w ujęciu statystycznym, [w:] „Rozprawy Komisji Językowej”, t. XVI, Wrocławskie Towarzystwo Naukowe, Wyd. PAN, Wrocław, s. 69-87.
- Karpiński Ł., Zarys leksykografii terminologicznej, KJS UW, Warszawa, 2008
- Karpiński Ł., 2009a, Wybrane założenia komputerowej analizy tekstów i gromadzenia danych, [w] „Języki Specjalistyczne 9 – Kulturowy i leksykograficzny obraz języków specjalistycznych”, (red. eidem), KJS UW, Warszawa
- Karpiński Ł., 2012a, Analiza parametryczna tekstu a translacja maszynowa – wybrane zagadnienia, [w] „The Linguistic Journal of Applied Linguistics”, (red.), Lingwistyczna Szkoła Wyższa w Warszawie, Warszawa.
- Karpiński Ł., Michałowski P., 2012, Wybrane metody analizy terminologii specjalistycznej (na przykładzie technolektu geografii), [w:] „Edukacja dla Przyszłości”, t. IX, 2012, Wydawnictwo Wyższej Szkoły Finansów i Zarządzania w Białymstoku, Białystok, s. 19-46.
- Lewandowska-Tomaszczyk B., Podstawy językoznawstwa korpusowego, Wyd. Uniwersytetu Łódzkiego, Łódź 2005.
- Ludskanow A., 1973, Tłumaczy człowiek i maszyna cyfrowa, WNT, Warszawa
- McEnery T., Wilson A., Corpus Linguistics: an Introduction, Edinburgh : Edinburgh University Press, 2001
- Pawłowski A., 2001, Metody kwantytatywne w sekwencyjnej analizie tekstu, Uniwersytet Warszawski Katedra Lingwistyki Formalnej, Warszawa.
- Przepiórkowski A., Bańko M., Górski R., Lewandowska-Tomaszczyk B., Narodowy Korpus Języka Polskiego, PWN, Warszawa 2012
- Sambor J., 1969, Badania statystyczne nad słownictwem. Na materiale „Pana Tadeusza”, Wrocław-Warszawa.
- Świdziński M., 2006, Lingwistyka korpusowa w Polsce – źródła, stan, perspektywy, [w:] „LingVaria”, nr 1, Wydział Polonistyki UJ, Kraków.
- Tognini-Bonelli E., Corpus Linguistics at Work, John Benjamins, Amsterdam/Philadelphia 2001
Lecturer's original materials, collections of analyses, visualisations of stylometric data.
|
Term 2025L:
as in the main description |
Notes
|
Term 2025L:
--- |
Additional information
Additional information (registration calendar, class conductors, localization and schedules of classes), might be available in the USOSweb system: