92 items
An Italian-Serbian Sentence Aligned Parallel Literary Corpus
This article presents the construction and relevance of an Italian-Serbian sentence-aligned parallel corpus, delving into the aligned sentences in order to facilitate effective translation between the two languages. The parallel corpus serves as a valuable resource for language experts, researchers, and language enthusiasts, fostering a deeper understanding of linguistic nuances and cultural expressions. By bridging the gap between Serbian and Italian, this corpus opens new avenues for cross-cultural communication and collaboration, and ultimately contributes to the improvement of language-related ...Saša Moderc, Ranka Stanković, Aleksandra Tomašević, Mihailo Škorić. "An Italian-Serbian Sentence Aligned Parallel Literary Corpus" in Review of the National Center for Digitization, Belgrade : Faculty of Mathematics, University of Belgrade (2023). https://doi.org/10.5281/zenodo.11203388
E-Connecting Balkan Languages
In this paper we present a versatile language processing tool that can be successfully used for many Balkan languages. This tool relies for its work on several sophisticated textual and lexical resources that were developed for most of Balkan languages. These resources are based on several de facto standards in natural language processing.... американски щати are connected automatically. 3. Using WS4LR with Aligned Texts The WS4LR module that works with aligned texts expects them to be in Translation Memory eXchange (TMX) format1. It can also transform texts previously aligned by XAlign into that format but also in several other formats: ...
... visualization of aligned texts by applying appropriate XSLT transformations. Thus visualized texts user can freely browse. One such visualization is represented in Figure 1. Browsing, however, is not a particularly successful form of text exploration. WS4LR module for aligned texts offers users ...
... methodological framework was used for their development, and how they were integrated for their successful usage. 2.1 Textual Resources – Aligned Texts The aligned texts as a special form of multilingual corpora were in focus of many projects in past couple of decades. A systematic approach to the ...Cvetana Krstev, Ranka Stanković, Duško Vitas, Svetla Koeva. "E-Connecting Balkan Languages" in Proceedings of the Workshop Workshop on Multilingual resources, technologies and evaluation for Central and Eastern European Languages, 17 September 2009, eds. C. Vertan, S. Piperidis, E. Paskaleva and Milena Slavcheva, Borovets, Bulgaria : Association for Computational Linguistics Stroudsburg, PA, USA (2009)
Using English Baits to Catch Serbian Multi-Word Terminology
In this paper we present the first results in bilingual terminology extraction. The hypothesis of our approach is that if for a source language domain terminology exists as well as a domain aligned corpus for a source and a target language, then it is possible to extract the terminology for a target language. Our approach relies on several resources and tools: aligned domain texts, domain terminology for a source language, a terminology extractor for a target language, and a ...aligned texts, word alignment, terminology extraction, electronic dictionaries, morphological inflection... language domain terminology exists as well as a domain aligned corpus for a source and a target language, then it is possible to extract the terminology for a target language. Our approach relies on several resources and tools: aligned domain texts, domain terminology for a source language, a terminology ...
... extracted 846 different Serbian domain phrases, containing 515 Serbian phrases that were not present in the existing domain terminology. Keywords: aligned texts, word alignment, terminology extraction, electronic dictionaries, morphological inflection 1. Motivation Terminology is rapidly developing in many ...
... Serbia, with the aim of presenting the librarianship terminology on different me- dia (Kovačević et al., 2004). This resource was first used on aligned texts in query ex- pansion (Stanković et al., 2012); the Excel format of the dictionary was at that time transformed into a relational database. The ...Cvetana Krstev, Branislava Šandrih, Ranka Stanković. "Using English Baits to Catch Serbian Multi-Word Terminology" in Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018, European Language Resources Association (ELRA) (2018)
Softverski alati za korišćenje resursa za srpski jezik
Ivan Obradović, Ranka Stanković (2008)... parallel texts. In the majority of cases, parallel texts are be- ing aligned, which turns a parallel texts into an aligned text. Sometimes, it is even considered that parallel texts are the same as aligned texts, but this does not always have to be the case, since non-aligned parallel texts are also ...
... corpora composed of par- allel texts or bi-texts, usually comprising two texts of which one is original, and the other its translation. The majority of these parallel texts are aligned, which means that relations are estab- lished between corresponding elements of both texts (paragraph, sentence, word) ...
... for a synset and its hypernyms 3.4 Aligned texts WS4LR contains a module for processing of parallel texts which have previously been aligned using the text alignment tool XAlign (Bonhomme et al., 2001). The module enables the transformation of texts aligned by XAlign into different formats: textual ...Ivan Obradović, Ranka Stanković. "Softverski alati za korišćenje resursa za srpski jezik" in INFOteka: časopis za informatiku i bibliotekarstvo, Belgrade, Serbia : Zajednica biblioteka univerziteta u Srbiji (2008)
Wordnet Development Using a Multifunctional Tool
Ivan Obradović, Ranka Stanković (2007)In this paper we present a multifunctional tool for manipulating heterogeneous language resources. The tool handles electronic dictionaries, wordnets and aligned texts, and provides for their synchronous use in various tasks. We focus here on the description of the possibilities this tool offers in the development of wordnets. Besides the wordnet module which enables parallel handling of two wordnets, other modules, such as the module for morphological dictionaries and the module for aligned texts, as well as available finite ...... 8. Aligned texts with highlighted words Another, more complex option is to use aligned texts. If PWN is used for the source synset, then the language of one of the parallel texts must be English. Namely, WS4LR allows the user to search aligned texts using words from both parallel texts. All ...
... module for management of aligned parallel texts uses texts which have previously been aligned using Xalign as an alignment tool [3]. The module converts these texts to the Translation Memory eXchange (TMX) format, which is becoming the standard format for aligned texts. Figure 4 depicts the form ...
... of aligned parallel texts Parallel texts, which usually originate from a text in one language and its translation in another, are often aligned at a certain level (paragraph, sentence, etc) by matching the corresponding segments of the original and its translation. Aligned parallel texts are ...Ivan Obradović, Ranka Stanković. "Wordnet Development Using a Multifunctional Tool" in Proceedings of the International Workshop Computer Aided Language Processing (CALP) '2007, Borovets, Bulgaria, September 2007, - (2007)
The Many Faces of SrpKor
Акроним СрпКор означава фамилију електронских корпуса савременог српског језика чија је изградња почела крајем седамдесетих година прошлога века, а која је постала шире видљива заинтересованој истраживачкој заједници објављивањем његове прве верзије на вебу 2002. године. У овом дугом периоду, посебно пре појаве корисних текстуелних ресурса на вебу, развој корпуса се састојао у прикупљању и обради грађе као и у развоју метода обраде корпуса. Наиме, електронски корпус није само колекција текстова у дигиталном облику (како се то, на пример, наводи ...Duško Vitas, Ranka Stanković, Cvetana Krstev. "The Many Faces of SrpKor" in South Slavic Languages in the Digital Environment JuDig Book of Abstracts, University of Belgrade - Faculty of Philology, Serbia, November 21-23, 2024, University of Belgrade - Faculty of Philology (2024.)
Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian
The training of new tagger models for Serbian is primarily motivated by the enhancement of the existing tagset with the grammatical category of a gender. The harmonization of resources that were manually annotated within different projects over a long period of time was an important task, enabled by the development of tools that support partial automation. The supporting tools take into account different taggers and tagsets. This paper focuses on TreeTagger and spaCy taggers, and the annotation schema alignment ...... performed for the production of Serbian MULTEXT-East resources (Krstev et al., 2004)). 2.2. Pre-annotated texts Various pre-annotated texts were used in this research for training and testing. These texts were tagged mainly us- ing SMD (and its tagset) and the Unitex system,1 with manually performed d ...
... All texts had to be mapped to tagsets used by the existing tagger model TT11 and the two new tagger models TT19 and SerSpaCy (see Subsection 3.3.). Although most of the texts were tagged with SMD before mapping to some other tagset, the initial SMD version was not available for all texts (e.g. ...
... et al., 2006). It contains texts from law, health and edu- cation domains. Švejk, Floods, History are three short 1Unitex/GramLab — Cross Plaform Corpus Processing Suite, https://unitexgramlab.org/ 2The category of gender is relevant only for some verbal forms. texts selected, respectively, from ...Ranka Stanković, Branislava Šandrih, Cvetana Krstev, Miloš Utvić, Mihailo Škorić. "Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian" in Proceedings of the 12th Language Resources and Evaluation Conference, May Year: 2020, Marseille, France, European Language Resources Association (2020)
Serbian NER&Beyond: The Archaic and the Modern Intertwinned
U ovom radu predstavljamo srpski književni korpus koji se razvija pod okriljem COST Akcije „Distant Reading for European Literary History” CA16204. Koristeći ovaj korpus romana napisanih pre više od jednog veka, razvili smo i učinili javno dostupnim Sistem za prepoznavanje imenovanih entiteta (NER) obučen da prepozna 7 različitih tipova imenovanih entiteta, sa konvolucionom neuronskom mrežom (CNN), koja ima F1 rezultat od ≈91% na test skupu podataka. Ovaj model je dalje ocenjen na posebnom skupu podataka za evaluaciju. Završavamo poređenje ...... NEs in 1253 newspapers and similar texts. It was manually evaluated on a sample of unseen newspaper texts. The overall F1 score of the model was ≈ 96%. To the best of our knowledge, so far there were no attempts to produce a NER system for Serbian literary texts. The enhanced version of SrpNER was la- ...
... forms satisfactorily on similar texts, which can be seen from the model’s performance on the test set displayed in Table 3. Since this collec- tion of novels contains very diverse texts, both lexically and syntactically, SrpCNNER did not generalize that well on unseen texts. 6 Conclusions and Future Work ...
... dubbed SrpNER, that we will describe in Sec- tion 2 together with some approaches to NE recognition in literary texts. This SrpNER model was applied to the raw version of the selected texts from SrpELTeC collection, pre- sented in Section 3. Based on the specifically tailored guidelines, different evaluators ...Branislava Šandrih Todorović, Cvetana Krstev, Ranka Stanković, Milica Ikonić Nešić. "Serbian NER&Beyond: The Archaic and the Modern Intertwinned" in Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications, INCOMA Ltd. Shoumen, BULGARIA (2021). https://doi.org/10.26615/978-954-452-072-4_141
Keyword Extraction from Parallel Abstracts of Scientific Publications
... author(s), publication date, title, keywords, abstract etc.) and are aligned at the sentence level [15,16]. For the research presented in this paper, we used a collection of 50 bilin- gual documents with approximately 4,800 aligned sentences. Since papers were published bilingually, they were already ...
... English, where most of the papers were originally written in Serbian and then translated into English by professional translators. Texts have various lengths, in Serbian the texts contain from 34 to 259 words (on average 100) and in English from 44 to 286 words (on average 110). The statistics of the used ...
... of annotated keywords ranges from 3 to 18 in the Serbian and from 3 to 15 in the English texts (the average in both is 7). Scientists usually define keywords in their lemmatized form, while in the Serbian texts (and rarely in English) they appear in many inflected forms, which are different from lemma ...Slobodan Beliga, Olivera Kitanović, Ranka Stanković, Sanda Martinčić-Ipšić . "Keyword Extraction from Parallel Abstracts of Scientific Publications" in Sematic Keyword-Based Search on Structured Data Sources - Third International KEYSTONE Conference, IKC 2017 Gdańsk, Poland, September 11–12, 2017 Revised Selected Papers and COST Action IC1302 Reports, Springer (2017)
Development and Evaluation of Three Named Entity Recognition Systems for Serbian - The Case of Personal Names
In this paper we present a rule- and lexicon-based system for the recognition of Named Entities (NE) in Serbian news paper texts that was used to prepare a gold standard annotated with personal names. It was further used to prepare training sets for four different levels of annota tion, which were further used to train two Named Entity Recognition (NER) sys tems: Stanford and spaCy. All obtained models, together with a rule- and lexicon based system were evaluated on ...... system. The im- portant next step is the enhancement of our news- paper corpus with other types of text (Wikipedia articles, domain texts, literary texts). The literary texts would be particularly important for improv- ing the recognition of first names. Finally, another intended step is Entity Linking ...
... by the considerably smaller number of these tags in training texts compared to other tags (see Ta- ble 4). As for SRPNER one can presume that de- velopers devoted less effort to this entity type oc- curring only occasionally in newspaper texts. Sim- ilarly, in all experiment settings, the recognition of ...
... used for the first time for the recognition of personal names in Serbian texts. Ljubešić et al. (2013) used STANFORD NER to build models for Croatian and Slovene. When they used distributional similarity to improve re- sults, on texts coming from different sources they obtained the following results: for ...Branislava Šandrih, Cvetana Krstev, Ranka Stanković. "Development and Evaluation of Three Named Entity Recognition Systems for Serbian - The Case of Personal Names" in Proceedings - Natural Language Processing in a Deep Learning World, Incoma Ltd., Shoumen, Bulgaria (2019). https://doi.org/10.26615/978-954-452-056-4_122
Bilingual lexical extraction based on word alignment for improving corpus search
Jelena Andonovski, Branislava Šandrih, Olivera Kitanović. "Bilingual lexical extraction based on word alignment for improving corpus search" in The Electronic Library, Emerald (2019). https://doi.org/10.1108/EL-03-2019-0056
A Tool for Enhanced Search of Multilingual Digital Libraries of E-journals
This paper outlines the main features of Bibliša, a tool that offers various possibilities of enhancing queries submitted to large collections of TMX documents generated from aligned parallel articles residing in multilingual digital libraries of e-journals. The queries initiated by a simple or multiword keyword, in Serbian or English, can be expanded by Bibliša, both semantically and morphologically, using different supporting monolingual and multilingual resources, such as wordnets and electronic dictionaries. The tool operates within a complex system composed ...... proper name databases, which enables, among other things, versatile handling of both monolingual and aligned or comparable texts. LeXimir provides for enhanced querying of aligned texts by using available lexical resources to perform semantic and morphological expansion of queries. The tool ...
... for search of document collections consisting of aligned parallel texts converted in TMX (Translation Memory eXchange) format. TMX is an open XML-based standard intended for easier exchange of translation memory data, that is, aligned parallel texts, between tools and translation vendors [TMX ...
... development environment for generating aligned parallel texts. It is basically a front-end for two alignment tools developed by LORIA (Laboratoire lorrain de recherche en informatique et ses applications), one for automatic sentence alignment of texts (Xalign, http://led.loria.fr/outils/A ...Ranka Stanković, Cvetana Krstev, Ivan Obradović, Aleksandra Trtovac, Miloš Utvić. "A Tool for Enhanced Search of Multilingual Digital Libraries of E-journals" in Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012, May 2012, Istanbul, Turkey, Istanbul, Turkey : European Language Resources Association (2012)
Old or New, We Repair, Adjust and Alter (Texts)
Cvetana Krstev, Ranka Stanković (2020)U ovom radu predstavljamo kako se e-rečnici i kaskade transduktora konačnih stanja implementirani u alatu Unitex mogu koristiti za rešavanje tri problema transformacije teksta: ispravljanje tekstova nakon OCR-a, vraćanje dijakritičkih znakova i prebacivanje između različitih jezičkih varijanti.ispravka teksta, OCR greške, restauracija dijakritika , jezičke varijante, elektronski rečnik, transduktori konačnih stanja... containing only texts written in Ekavian pronunciation and the other containing only texts written in Ijeka- vian pronunciation. In the case of multiple corrections, they are merged in one entry, as in the SRP_DR dictionary. Specific problems may arise with multiple corrections when transforming texts in either ...
... 2023-10-14 04:19:57 Old or New, We Repair, Adjust and Alter (Texts) Cvetana Krstev, Ranka Stanković Дигитални репозиторијум Рударско-геолошког факултета Универзитета у Београду [ДР РГФ] Old or New, We Repair, Adjust and Alter (Texts) | Cvetana Krstev, Ranka Stanković | Infotheca | 2020 | | 10 ...
... adjust and alter (texts) UDC 811.163.41’322.2: 004.9 DOI 10.18485/infotheca.2019.19.2.3 ABSTRACT: In this paper we present how e-dictionaries and cascades of finite-state transducers, as implemented in Unitex, can be used to solve three text transformation prob- lems: correction of texts after OCR, restora- ...Cvetana Krstev, Ranka Stanković. "Old or New, We Repair, Adjust and Alter (Texts)" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.3
A Method for Extracting Translational Equivalents from Aligned Texts
Obradović Ivan (2013)Obradović Ivan. "A Method for Extracting Translational Equivalents from Aligned Texts" in Methods and Applications of Quantitative Linguistics, Selected papers of the 8th International Conference on Quantitative Linguistics (QUALICO) in Belgrade, Serbia, April 26-29, 2012, Ivan Obradović, Emmerich Kelih, Reinhard Köhler (eds.), :University of Belgrade & Academic Mind (2013): 119-129
An Integrated Environment for Management and Exploitation of Linguistic Resources
Ranka Stanković, Ivan Obradović (2009)... possibility of adding hypernym literals. D. Aligned texts WS4LR contains a module for processing of parallel texts which have previously been aligned using the text align- ment tool XAlign. The module enables the transformation of texts aligned by XAlign into different formats: textual ...
... is publicly available [3]. C. Parallel and aligned texts Although monolingual parallel texts exist, parallel texts are as a rule bilingual, composed of one original text and its translation into another language. Thus, they represent two texts having the same content, but in two different ...
... different lan- guages. The majority of parallel texts collected within the HLT Groupare are aligned, with Serbian most often being one of the languages. The procedure of transforming paral- lel texts into aligned texts followed two basic steps with the goal of connecting equivalent segments ...Ranka Stanković, Ivan Obradović. "An Integrated Environment for Management and Exploitation of Linguistic Resources" in Proceedings of the International Multiconference on Computer Science and Information Technology, Computational Linguistics – Applications Workshop (CLA09), Mrągowo, Poland, October 2009, Piscataway : IEEE (2009)
The Nooj System as Module within an Integrated Language Processing Environment
... alignment of multilingual texts. WS4LR handles aligned texts as well. A pair of semantically equivalent texts in different languages, such as an original text and its translation, that are aligned on a structural level (paragraph, sentence, phrase, etc.) is known as an aligned text or bitext. One ...
... WS4LR module for management of aligned parallel texts uses texts which have previously been aligned using Xalign as an alignment tool (Bonhomme 2001). Parallel texts which usually originate from a text in one language and its translation in another, are often aligned at a certain level (paragraph ...
... translation. The module converts these texts to the Translation Memory eXchange (TMX) format, which is becoming the standard format for aligned texts. Figure 7 depicts the form with different possibilities for TMX document management. Aligned texts can be visualized in various ways by choosing ...Ranka Stanković, Duško Vitas, Cvetana Krstev. "The Nooj System as Module within an Integrated Language Processing Environment" in Proceedings of the 2007 International Nooj Conference, Cambridge Scholars Publishing (2008)
WS4LR - a Worksation for Lexical Resources
... in Appendix B. 2.3 Aligned Texts A pair of semantically equivalent texts in different langauges, such as an original text and its translation, that are and aligned on a structural level (paragraph, sentence, phrase, etc.) is known as an aligned text or bitext. Aligned texts are usually constructed ...
... chosen synset in a text, with or without synset hypernyms. 3.4 Working with Aligned Texts The module uses texts which have previously been aligned using Xalign as an alignment tool and converts them to TMX format, or texts that are already in that format. By choosing the appropriate XSLT stylesheet ...
... step, the texts to be aligned are segmented into equivalent units, and in the second step the correspondence between these units is established. The equivalent units are usually sentences, but the units can be larger, as well as smaller. The standard method for representing aligned texts is the ...Cvetana Krstev, Ranka Stanković, Duško Vitas, Ivan Obradović. "WS4LR - a Worksation for Lexical Resources" in Proceedings of the Fifth Interantional Conference on Language Resources and Evaluation, Genoa, Italy, May 2006, ELRA - European Language Resources Association (2006)
Social-Emo.Sr: Emotional Multi-Label Categorization of Conversational Messages from Social Networks X and Reddit
U digitalnom okruženju južnoslovenskih jezika, analiza emocija u tekstovima na društvenim mrežama postaje sve važnija za razumevanje javnog mnjenja, kreiranje personalizovanog sadržaja i analizu međusobnih interakcija korisnika. U okviru ovog rada predstavljamo detaljnu metodologiju i rezultate označavanja korpusa na srpskom jeziku prema Plutčikovom modelu kategorizacije, koji prepoznaje osam osnovnih emocionalnih kategorija, kao što su radost, tuga, bes, strah, poverenje, gađenje, iščekivanje i iznenađenje. Cilj istraživanja je da se analizira emocionalni sadržaj tekstova preuzetih sa društvenih mreža X (nekada Twitter) ...Milena Šošić, Ranka Stanković, Jelena Graovac. "Social-Emo.Sr: Emotional Multi-Label Categorization of Conversational Messages from Social Networks X and Reddit" in South Slavic Languages in the Digital Environment JuDig Book of Abstracts, University of Belgrade - Faculty of Philology, Serbia, November 21-23, 2024., University of Belgrade - Faculty of Philology (2024)
Digital Library From A Domain Of Criminalistics As A Foundation For A Forensic Text Analysis
U ovom radu predstavljen je model koji omogućava prikupljanje, pripremu, opis metapodataka, upravljanje i eksploataciju, uključujući pretragu punog teksta dokumenata iz domena kriminalistike napisanih na srpskom jeziku. Predloženi pristup primenjuje se na veb portalu koji sakuplja različite tekstove nastale iz časopisa Akademije za kriminalistiku i policijske studije, Krivičnog zakona Srbije, konferencija „Tara“ i „Reiss“, kao i iz nekih doktorskih disertacija vezanih za ovu oblast istraživanje. Nakon obrade teksta, korpus koji sadrži preko 5500 stranica običnog teksta, kreiran je i ...... FORENSIC LINGUISTICS The linguistic study of forensic texts is a part of the field of Natural Language Processing, which includes text types classification and syntax and semantic analysis of texts written in a natural language. Various texts are subject of the study: Acts of Parliament (or other ...
... l dictionaries cover large lexica, but each special domain has characteristic words that are occurs occasionally in ordinary texts, but frequently in domain specific texts. That is the case with presented collection. Among unrecognized tokens were terms: 18I. Obradović, R. Stanković, “Wordnet ...
... and „upad“ (intrusion) have negative sentiment polarity scores (0.75 and 0.125) respectively, which makes possible classify texts containning these terms as „forensic texts“. 19 Ranka Stanković, Cvetana Krstev, Ivan Obradović, Biljana Lazić, and Aleksandra Trtovac, “Rule-based Automatic ...Dalibor Vorkapić, Aleksandra Tomašević, Miljana Mladenović, Ranka Stanković, Nikola Vulović. "Digital Library From A Domain Of Criminalistics As A Foundation For A Forensic Text Analysis" in International Scientific Conference “Archibald Reiss Days” Thematic Conference Proceedings Of International Significance, Belgrade, 7-9 November 2017, Academy Of Criminalistic And Police Studies Belgrade (2017)
Contrastive Analysis of Syntax Patterns in Comparable Football Corpora in Spanish and Serbian Languages
Jelena Lazarević, Olivera Kitanović (2024.)Cilj rada je istraživanje kolokabilnosti kao načina na koji se leksičke jedinice povezuju sa rečima iz različitih kategorija, formirajući veće jedinice. Istraživanje semantičkih i sintaksičkih principa ovih kombinacija u španskom i srpskom jeziku fudbala izvedeno je na komparabilnim fudbalskim korpusima SrFudKo i EsFudko, razvijenim u okviru doktorske disertacije Jelene Lazarević pod nazivom: Jezičke odlike diskursa novih medija o fudbalu: kontrastivna analiza na korpusu srpskog i španskog jezika. Korpus fudbala SrFudKo, kreiran na osnovu tekstova o fudbalu sa pet srpskih veb-portala: ...Jelena Lazarević, Olivera Kitanović . "Contrastive Analysis of Syntax Patterns in Comparable Football Corpora in Spanish and Serbian Languages" in South Slavic Languages in the Digital Environment JuDig Book of Abstracts, University of Belgrade - Faculty of Philology, Serbia, November 21-23, 2024, University of Belgrade - Faculty of Philology (2024.)