Претрага
83 items
-
Bilingual lexical extraction based on word alignment for improving corpus search
Jelena Andonovski, Branislava Šandrih, Olivera Kitanović. "Bilingual lexical extraction based on word alignment for improving corpus search" in The Electronic Library, Emerald (2019). https://doi.org/10.1108/EL-03-2019-0056
-
Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian
The training of new tagger models for Serbian is primarily motivated by the enhancement of the existing tagset with the grammatical category of a gender. The harmonization of resources that were manually annotated within different projects over a long period of time was an important task, enabled by the development of tools that support partial automation. The supporting tools take into account different taggers and tagsets. This paper focuses on TreeTagger and spaCy taggers, and the annotation schema alignment ...... MULTEXT-East and Universal Part-of-Speech tagset. The trained models will be used to publish the new version of the Corpus of Contemporary Serbian as well as the Serbian literary corpus. The performance of developed taggers were compared and the impact of training set size was investigated, which resulted ...
... automate the process of an- notation schema harmonization and preparation of training datasets as much as possible. The corpus of training set texts needed checking and cor- rection. Corpus correction is a time consuming process, which requires a lot of manual intervention and help of lin- guistic specialists ...
... the multilin- gual corpus prepared in the scope of the project “Integrated European language data Repository Area” (Gavrilidou et al., 2006). It contains texts from law, health and edu- cation domains. Švejk, Floods, History are three short 1Unitex/GramLab — Cross Plaform Corpus Processing Suite, ...Ranka Stanković, Branislava Šandrih, Cvetana Krstev, Miloš Utvić, Mihailo Škorić. "Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian" in Proceedings of the 12th Language Resources and Evaluation Conference, May Year: 2020, Marseille, France, European Language Resources Association (2020)
-
Distant Reading in Digital Humanities: Case Study on the Serbian Part of the ELTeC Collection
Ranka Stanković, Cvetana Krstev, Branislava Šandrih Todorović, Duško Vitas, Mihailo Škorić, Milica Ikonić Nešić (2022)In this paper we present the Serbian part of the ELTeC multilingual corpus of novels written in the time period 1840-1920. The corpus is being built in order to test various distant reading methods and tools with the aim of re-thinking the European literary history. We present the various steps that led to the production of the Serbian sub-collection: the novel selection and retrieval, text preparation, structural annotation, POS-tagging, lemmatization and named entity recognition. The Serbian sub-collection was published ...Ranka Stanković, Cvetana Krstev, Branislava Šandrih Todorović, Duško Vitas, Mihailo Škorić, Milica Ikonić Nešić. "Distant Reading in Digital Humanities: Case Study on the Serbian Part of the ELTeC Collection" in Proceedings of the Language Resources and Evaluation Conference, June 2022, Marseille, France, European Language Resources Association (2022)
-
SASA Dictionary as the Gold Standard for Good Dictionary Examples for Serbian
Ranka Stanković, Branislava Šandrih, Rada Stijović, Cvetana Krstev, Duško Vitas, Aleksandra Marković (2019)У овом раду представљамо модел за избор добрих примера за речник српског језика и развој иницијалних компоненти модела. Метода која се користи заснива се на детаљној анализи различитих лексичких и синтактичких карактеристика у корпусу састављених од примера из пет дигитализованих свезака речника САНУ. Почетни скуп функција био је инспирисан сличним приступом и за друге језике. Дистрибуција карактеристика примера из овог корпуса упоређује се са карактеристиком дистрибуције узорака реченица ексцерпираних из корпуса који садрже различите текстове. Анализа је показала да ...Српски, добри примери из речника, аутоматизација израде речника, издвајање својстава, Машинско учење... supported by the corpus data. It is common for lexicographers to look for examples in the corpus of contemporary Serbian (SrpKor, developed by D. Vitas and a group of collaborators from University of Belgrade, http://www.korpus.matf.bg.ac.rs/korpus/), which is being used as a control corpus, but they rarely ...
... to the corpus made of examples, we prepared a control dataset derived from various texts, which was used as a sample corpus for dictionary example extraction. The control dataset of example candidates was obtained from the digital library Biblisha8 (Stanković et al., 2017), SrpKor – the corpus of c ...
... syntactic features in a corpus compiled of examples from the five digitized volumes of the Serbian Academy of Sciences and Arts (SASA) dictionary. The initial set of features was inspired by a similar approach for other languages. The feature distribution of examples from this corpus is compared with the ...Ranka Stanković, Branislava Šandrih, Rada Stijović, Cvetana Krstev, Duško Vitas, Aleksandra Marković. "SASA Dictionary as the Gold Standard for Good Dictionary Examples for Serbian" in Electronic lexicography in the 21st century. Proceedings of the eLex 2019 conference , Lexical Computing CZ, s.r.o. (2019)
-
Extraction of Bilingual Terminology Using Graphs, Dictionaries and GIZA++
Branislava Šandrih, Ranka Stanković (2020)U nauci, industriji i mnogim istraživačkim oblastima, terminologija se brzo razvija. Najčešće, jezik koji je „lingua franca“ za većinu ovih oblasti je engleski. Kao posledica toga, za mnoga polja termini domena su koncipirani na engleskom, a kasnije se prevode na druge jezike. U ovom radu predstavljamo pristup za automatsko izdvajanje dvojezične terminologije za englesko-srpski jezički par koji se oslanja na usaglašeni dvojezični korpus domena, ekstraktor terminologije za ciljni jezik i alat za usklađivanje delova. Ispitujemo performanse metode na domenu ...... obtaining 8 different experimental settings: 1. The input domain aligned corpus (Input i) consists of: (a) the aligned corpus LIS-corpus; (b) the aligned corpus LIS-corpus extended with the bilingual aligned pairs bi-list (LIS-corpus+); 2. The list of domain terms for the source language (Input ii) ...
... sentence-aligned domain-specific corpus involving a source and a target language, denoted as S(text.align) ↔ T (text.align). In this paper we refer to this tool as LIS-corpus. As a textual resource, twelve issues with a total of 84 papers were aligned at the sentence level resulting in 14,710 aligned segments ...
... applied to the source language part of the aligned input corpus; 3. The extraction of the set of MWTs in the target language by Serb-TE (Input iii) was done: (a) on the target language part of the aligned chunks (chunk); (b) on the target language part of the aligned input sentences (text). Infotheca Vol ...Branislava Šandrih, Ranka Stanković. "Extraction of Bilingual Terminology Using Graphs, Dictionaries and GIZA++" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.6
-
From ELTeC Text Collection Metadata and Named Entities to Linked-data (and Back)
In this paper we present the wikification of the ELTeC (European Literary Text Collection), developed within the COST Action ``Distant Reading for European Literary History'' (CA16204). ELTeC is a multilingual corpus of novels written in the time period 1840—1920, built to apply distant reading methods and tools to explore the European literary history. We present the pipeline that led to the production of the linked dataset, the novels’ metadata retrieval and named entity recognition, transformation, mapping and Wikidata population, ...Milica Ikonić Nešić, Ranka Stanković, Christof Schöch and Mihailo Škorić. "From ELTeC Text Collection Metadata and Named Entities to Linked-data (and Back)" in Proceedings of The 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference, June 2022, Marseille, France, European Language Resources Association (2022)
-
Towards ELTeC-LLOD: European Literary Text Collection Linguistic Linked Open Data
Овај рад описује студију случаја о генерисању повезаних података креираних на основу обечежених текстуалних корпуса коришћењем формата размене података у обради природних језика (NIF). Као основа за ово истраживање послужио је подскуп корпуса ELTeC, који се састоји од 900 романа из периода 1840-1920 за 9 европских језика. Верзија романа са коментарима, у такозваном TEI level-2 формату, трансформисана је у NIF, формат заснован на RDF/OWL који има за циљ постизање интероперабилности између алата за обраду природних језика, језичких ресурса и ...Ranka Stanković, Christian Chiarcos, Miloš Utvić, Olivera Kitanović. "Towards ELTeC-LLOD: European Literary Text Collection Linguistic Linked Open Data" in LDK 2023 – 4th Conference on Language, Data and Knowledge, 12-15 September in Vienna, Austria, Lisabon : NOVA FCSH - CLUNL (2023). https://doi.org/10.34619/srmk-injj
-
OntoLex Publication Made Easy: A Dataset of Verbal Aspectual Pairs for Bosnian, Croatian and Serbian
Ovaj rad predstavlja novi jezički resurs za pretraživanje i istraživanje verbalnih aspektnih parova u BCS (bosanskom, hrvatskom i srpskom), kreiran korišćenjem principa Lingvističkih Povezanih Otvorenih Podataka (LLOD). Pošto ne postoji resurs koji bi pomogao učenicima bosanskog, hrvatskog i srpskog kao stranih jezika da prepoznaju aspekt glagola ili njegove parove, kreirali smo novi resurs koji će korisnicima pružiti informacije o aspektu, kao i link ka aspektnim parovima glagola. Ovaj resurs takođe sadrži spoljne linkove ka monolingvalnim rečnicima, Wordnetu i BabelNetu. ...Ranka Stanković, Maxim Ionov, Medina Bajtarević, Lorena Ninčević. "OntoLex Publication Made Easy: A Dataset of Verbal Aspectual Pairs for Bosnian, Croatian and Serbian" in Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024, Turin, 20-25 May 2024, ELRA and ICCL (2024)
-
Serbian NER&Beyond: The Archaic and the Modern Intertwinned
U ovom radu predstavljamo srpski književni korpus koji se razvija pod okriljem COST Akcije „Distant Reading for European Literary History” CA16204. Koristeći ovaj korpus romana napisanih pre više od jednog veka, razvili smo i učinili javno dostupnim Sistem za prepoznavanje imenovanih entiteta (NER) obučen da prepozna 7 različitih tipova imenovanih entiteta, sa konvolucionom neuronskom mrežom (CNN), koja ima F1 rezultat od ≈91% na test skupu podataka. Ovaj model je dalje ocenjen na posebnom skupu podataka za evaluaciju. Završavamo poređenje ...... ikonic.nesic@fil.bg.ac.rs Abstract In this work, we present a Serbian litera- ry corpus that is being developed under the umbrella of the “Distant Reading for European Literary History” COST Acti- on CA16204. Using this corpus of novels written more than a century ago, we ha- ve developed and made publicly ...
... statistical and data-driven approach. The NE layer of ELTeC corpus has presently been produced for three languages: Hungarian, Portuguese, Slovene. Santos et al. (2020) repor- ted on the NER annotation of the Portuguese sub-collection of the ELTeC corpus. Authors used the PALAVRAS-NER parser, a Constra- int ...
... tated sub-collections of 100 novels per language written more than 100 years ago. In this paper, we present a collection of Ser- bian texts in this corpus, named SrpELTeC. Alongside, we describe our efforts in developing its Named Entity (NE) layer, defined previou- sly as one of the main action’s de ...Branislava Šandrih Todorović, Cvetana Krstev, Ranka Stanković, Milica Ikonić Nešić. "Serbian NER&Beyond: The Archaic and the Modern Intertwinned" in Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications, INCOMA Ltd. Shoumen, BULGARIA (2021). https://doi.org/10.26615/978-954-452-072-4_141
-
A Lexical Approach to Acronyms and their Definitions
In this paper we present a comprehensive approach to acronyms for Natural-Language Processing (NLP) of Serbian texts. The proposed procedure includes extraction of acronyms and their definitions that are usual Multi-Word Units (MWUs), shallow parsing of MWUs that enables MWU lemmatization and production of entries in morphological electronic dictionaries, both for MWU and acronyms, that are provided with grammatical, syntactic, semantic and domain information. This approach enables representation that reflects complex relations between acronyms and their definitions.... +DOM=Mil +ACR=KFOR:ms3:ms7 3. Used Resources and Tools Corpus: As a corpus we have used an excerpt from the Corpus of Contemporary Serbian3 that has more than 22 million simple word forms (more than one million sen- tences). This corpus contains 70% of newspaper texts (57% daily, 8% weekly and ...
... 21 multiple entries 19 e-dict_MW_forms 9,956 6 corpus 23MW acr. inflects 333 acronyms 987 acr. in masc. 323 acr. in fem. 64 acr. both m&f 35 acr. in plural 2 7 acronyms 987 e-dict_acr_forms 6,946 Table 1: Results of the procedure on our corpus The majority of (acronym, definition) pairs represent ...
... logical dictionary of multi-word units. In IceTAL, vol- ume 6233 of LNCS. Springer. Krstev, C. and D. Vitas, 2005. Corpus and Lexicon – Mu- tual Incompletness. In Proc. of the Corpus Linguistics Conference, Birmingham. Liberman, Mark Y and Kenneth W Church, 1992. Text analysis and word pronunciation ...Cvetana Krstev, Duško Vitas, Ranka Stanković. "A Lexical Approach to Acronyms and their Definitions" in Proceedings of the 7th Language & Technology Conference, November 27-29, 2015, Poznań, Poland, Springer (2015)
-
Using Query Expansion for Cross-Lingual Mathematical Terminology Extraction
Velislava Stoykova, Ranka Stanković (2018)Velislava Stoykova, Ranka Stanković. "Using Query Expansion for Cross-Lingual Mathematical Terminology Extraction" in Advances in Intelligent Systems and Computing, Springer International Publishing (2018). https://doi.org/10.1007/978-3-319-91189-2_16
-
Using Lexical Resources for Irony and Sarcasm Classification
The paper presents a language dependent model for classification of statements into ironic and non-ironic. The model uses various language resources: morphological dictionaries, sentiment lexicon, lexicon of markers and a WordNet based ontology. This approach uses various features: antonymous pairs obtained using the reasoning rules over the Serbian WordNet ontology (R), antonymous pairs in which one member has positive sentiment polarity (PPR), polarity of positive sentiment words (PSP), ordered sequence of sentiment tags (OSA), Part-of-Speech tags of words (POS) ...... linguistic patterns were suggested for recognition of ironic statements in a corpus of tweets in Chinese, while authors in [32] used the pattern “about as * as *” as a query sent to the Google Search engine, in order to obtain a corpus of examples of the rhetorical figure of comparison (simile) which was later ...
... “figuratively” [14]. Authors in [29] used a microblogging platform called Plurk, similar to Twitter, to create a corpus of ironic statements. Other on- line resources can also be used for ironic corpus creation – Google Books search was used in the work by [21], where the phrase “said sarcastically” was used ...
... [4],[12],[28], where irony and sarcasm were treated in the same way in classification tasks. Similarly as in the case of building a corpus of ironic tweets, a corpus of sarcastic tweets has been generated based on online search with geolocation and time constraints, using the hashtag #sarkazam(sarcasm) ...Miljana Mladenović, Cvetana Krstev, Jelena Mitrović, Ranka Stanković. "Using Lexical Resources for Irony and Sarcasm Classification" in Proceedings of the 8th Balkan Conference in Informatics (BCI '17), New York, NY, USA, : ACM (2017). https://doi.org/
-
SrpELTeC on Platforms: Udaljeno čitanje, Aurora, NoSketch
Serbian ELTeC collection (100 novels and extended) developed within COST action CA16204 Distant Reading for European Literary History comprises at this moment 111 novels published in the period 1840-1920. Such a valuable resource is and will be used for various lexical and linguistic research, by using different tools and methodologies. In this paper, three platforms on which these novels are published will be presented: “Udaljeno ˇcitanje”, Aurora and Sketch Engine.Ranka Stanković, Mihailo Škorić, Petar Popović. "SrpELTeC on Platforms: Udaljeno čitanje, Aurora, NoSketch" in Infotheca, Faculty of Philology, University of Belgrade (2022). https://doi.org/10.18485/infotheca.2021.21.2.7
-
Нове технологије за оживљавање старих текстова
удаљено читање, књижевни корпус, обрада српског језика, анотација врстом речи, лематизација, именовани ентитетиЦветана Крстев, Ранка Станковић, Бранислава Шандрих Тодоровић, Милица Иконић Нешић. "Нове технологије за оживљавање старих текстова" in Зборник радова Међународне научне конференције Дигитална хуманистика и словенско културно наслеђе II, Београд, 28-29 јуни 2021., Београд : Савез славистичких друштава Србије (2023)
-
Претрага корпуса заснована на употреби екстерних лексичких ресурса путем веб-сервиса
У раду се разматра хибридни приступ претрази корпуса, илустрован на примеру алатки OCWB и NoSketch Engine, примењених на специјални корпус из области рударства (РудКор) и Корпус савременог српског језика (СрпКор). Разматрани приступ комбинује постојеће могућности алатки OCWB и NoSketch Engine, које своју претрагу заснивају на лингвистичкој анотацији корпуса, са новим могућностима претраге у виду консултовања екстерних језичких ресурса (морфолошки електронски речници српског језика и лексичка база података Српски ворднет). Хибридни приступ је реализован надоградњом вебсучеља која поменуте алатке користе ...... Lazić THE CORPUS SEARCH BASED ON USAGE OF EXTERNAL LEXICAL RESOURCES THROUGH WEB SERVICES Summary Тhis paper explores a hybrid approach to corpus search illustrated by application of corpus tools OCWB and NoSketch Engine to special corpus for mining domain, RudKor, as well as to Corpus of Contemporary ...
... Architecture for the New Millennium”, In: Proceedings of the Corpus Linguistics 2011 Conference, Birmingham, University of Birmingham. Еверт 2019: Stefan Evert and The OCWB Development Team, CQP Query Lan- guage Tutorial, The IMS Open Corpus Workbench (CWB 3.4.16), May 2019, http://cwb.sourceforge. ...
... i.org/10.18485/msc.2018.47.3.ch7. Харди 2012: Andrew Hardie, “CQPweb – Combining Power, Flexibility and Us- ability in a Corpus Analysis Tool”, International Journal of Corpus Linguis- tics. 17/3, 380–409. Шмид 1997: Helmut Schmid, „Probabilistic Part-of-Speech Tagging Using Deci- sion Trees”, In ...Милош Утвић, Ранка Станковић, Александра Томашевић, Михаило Шкорић, Биљана Лазић. "Претрага корпуса заснована на употреби екстерних лексичких ресурса путем веб-сервиса" in Научни састанак слависта у Вукове дане - Vol. 48/3 Српски језик и његови ресурси, Међународни славистички центар, Филолошки факултет, Универзитет у Београду (2019). https://doi.org/10.18485/msc.2019.48.3.ch12
-
Digital Library From A Domain Of Criminalistics As A Foundation For A Forensic Text Analysis
U ovom radu predstavljen je model koji omogućava prikupljanje, pripremu, opis metapodataka, upravljanje i eksploataciju, uključujući pretragu punog teksta dokumenata iz domena kriminalistike napisanih na srpskom jeziku. Predloženi pristup primenjuje se na veb portalu koji sakuplja različite tekstove nastale iz časopisa Akademije za kriminalistiku i policijske studije, Krivičnog zakona Srbije, konferencija „Tara“ i „Reiss“, kao i iz nekih doktorskih disertacija vezanih za ovu oblast istraživanje. Nakon obrade teksta, korpus koji sadrži preko 5500 stranica običnog teksta, kreiran je i ...... “reality” begin? BAAL Conference on The Ethics of Online Research Methods. Cardiff 9 Cvetana Krstev, Duško Vitas, “Corpus and Lexicon - Mutual Incompletness ”, in Proceedings of the Corpus Linguistics Conference, 14-17 July 2005, Birmingham, eds. Pernilla Danielsson and Martijn Wagenmakers, ISSN 1747-9398 ...
... Automata, Text and Electronic Dictionaries, Faculty of philology, Belgrade, 2008. 2. Cvetana Krstev, Duško Vitas, “Corpus and Lexicon - Mutual Incompletness ”, in Proceedings of the Corpus Linguistics Conference, 14-17 July 2005, Birmingham, eds. Pernilla Danielsson and Martijn Wagenmakers, ISSN 1747-9398 ...
... code of Serbia, the “Tara” and “Reiss” conferences, and from some of PhD dissertations related to this field of research. After text processing, a corpus containing over 5500 pages of plain text is created and prepared for publication as an online resource for full text search using Omeka, an open ...Dalibor Vorkapić, Aleksandra Tomašević, Miljana Mladenović, Ranka Stanković, Nikola Vulović. "Digital Library From A Domain Of Criminalistics As A Foundation For A Forensic Text Analysis" in International Scientific Conference “Archibald Reiss Days” Thematic Conference Proceedings Of International Significance, Belgrade, 7-9 November 2017, Academy Of Criminalistic And Police Studies Belgrade (2017)
-
Serbian ELTeC Sub-Collection in Wikidata
This paper presents an example of integration of Wikidata with digital libraries and external systems, as well as some best practices for speeding up the process of data preparation and import to Wikidata, on the use case of SrpELTeC, Serbian subcollection of the ELTeC multilingual collection (European Literary Text Collection). After preliminary work on the manual Wikidata population with SrpELTeC novels, the goal was to automate the process of preparing and importing information, so different solutions were analysed and ...Milica Ikonić Nešić, Ranka Stanković, Biljana Rujević. "Serbian ELTeC Sub-Collection in Wikidata" in Infotheca, Faculty of Philology, University of Belgrade (2021). https://doi.org/10.18485/infotheca.2021.21.2.4
-
Annotation of the Serbian ELTeC Collection
Ovaj rad predstavlja takozvano izdanje nivoa 2 kolekcije tekstova SrpELTeC razvijene u okviru aktivnosti Radne grupe 2 – Metode i alati COST akcije CA 16204 (Distant Reading for European Literary History) i njene specifikacije šeme. Izdanje nivoa 2 je nastavak izdanja nivoa 1, koje se koristi kao ulaz za morfosintaksičke i NER anotacije romana. Srpska obrada nivoa-2 je navedena kroz potrebne korake, uključujući metode i alate koji se koriste u tom procesu. Neki statistički podaci iz srpske kolekcije nivoa ...udaljeno čitanje, literarni korpus, tagiranje, prepoznavanje imenovanih entiteta, lematizacija, ELTeCRanka Stanković, Cvetana Krstev, Branislava Šandrih Todorović, Mihailo Škorić. "Annotation of the Serbian ELTeC Collection" in Infotheca, Faculty of Philology, University of Belgrade (2021). https://doi.org/10.18485/infotheca.2021.21.2.3
-
Development of Open Educational Resources (OER) for Natural Language Processing
In this paper we present the development of an online course at the edX BAEKTEL platform named “Lexical Recognition in the Natural Language Processing (NLP)”. It is based on the course of the same name for PhD studies at the University of Belgrade, Faculty of Philology. There are not many courses in Computational Linguistics (CL) on OER platforms, and there is none in Serbian either for CL or NLP. We have developed this course in order to improve this ...... improve this situation as it can prove useful both for linguists working in corpus linguistics and computer scientists developing NLP applications. The participant will become familiar with the use of Unitex, the corpus processing system for which many valuable resources for Serbian were already ...
... oriented processing of texts in human languages. As an illustration, a string oriented web interface to the Corpus of Contemporary Serbian 13 is presented. [13][14] 2. Unitex corpus processing system is presented from the practical point of view: how to install it and start working with it ...
... regular expressions and graphs; text transformations; processing of monolingual and bilingual texts (bi- texts in which basic segments are aligned). Unitex is freely distributed under the terms of the Lesser General Public License (LGPL). This means that everyone can redistribute Unitex ...Cvetana Krstev, Biljana Lazić, Ranka Stanković, Giovanni Schiuma, Miladin Kotorčević. "Development of Open Educational Resources (OER) for Natural Language Processing" in The Sixth International Conference on e-Learning (eLearning-2015), September 2015, Belgrade, Serbia, Belgrade : Belgrade Metropolitan Univesity (2015)
-
Old or New, We Repair, Adjust and Alter (Texts)
Cvetana Krstev, Ranka Stanković (2020)U ovom radu predstavljamo kako se e-rečnici i kaskade transduktora konačnih stanja implementirani u alatu Unitex mogu koristiti za rešavanje tri problema transformacije teksta: ispravljanje tekstova nakon OCR-a, vraćanje dijakritičkih znakova i prebacivanje između različitih jezičkih varijanti.ispravka teksta, OCR greške, restauracija dijakritika , jezičke varijante, elektronski rečnik, transduktori konačnih stanja... present the solution for detecting and correcting OCR errors developed for the compilation of the corpus of Serbian novels written and published in the period 1840–1920.2 The novels selected for this corpus were mainly printed in Cyrillic script (only a few of them were in Latin script). When scanning ...
... The text obtained by OCR is processed using Serbian morphological electronic dictionaries (SMD) (Krstev, 2008); 2 This corpus is a part of the European Literary Text Collection corpus (ElTEC) developed in the scope of the COST action 16204 Distant Reading for European Literary History (d-reading). 64 ...
... performed by finite-state transducers (FST) implemented in Unitex.3 A separate FST is written for each replacement 3 UnitexGramLab, a lexicon-based corpus processing suite. Infotheca Vol. 19, No. 2, December 2019 65 Krstev C., Stanković R., “Old or new, we repair . . . ”, pp. 61–80 a non-valid ...Cvetana Krstev, Ranka Stanković. "Old or New, We Repair, Adjust and Alter (Texts)" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.3