345 items
Transformer-Based Composite Language Models for Text Evaluation and Classification
Parallel natural language processing systems were previously successfully tested on the tasks of part-of-speech tagging and authorship attribution through mini-language modeling, for which they achieved significantly better results than independent methods in the cases of seven European languages. The aim of this paper is to present the advantages of using composite language models in the processing and evaluation of texts written in arbitrary highly inflective and morphology-rich natural language, particularly Serbian. A perplexity-based dataset, the main asset for the ...Mihailo Škorić, Miloš Utvić, Ranka Stanković. "Transformer-Based Composite Language Models for Text Evaluation and Classification" in Mathematics, MDPI AG (2023). https://doi.org/10.3390/math11224660
A WordNet Ontology in Improving Searches of Digital Dialect Dictionary
In this paper, we present a method for automatic generation of a digital resource, which connects all indirect synonyms of a dialect term to all indirect synonyms of a corresponding term in the standard language, aiming to improve the search of a digital dialect dictionary. The method uses SWRL rules defined in the Serbian WordNet ontology to identify sets of synonymous words. It also uses e-dictionaries to produce correct lemmas in standard language that users usually employ in searches. ...... definition. For lemmatization task we used Serbian morphological electronic dictionaries and grammars developed within the University of Belgrade Human Language Technology Group [14]. Morphological electronic dictionaries of Serbian for NLP are being developed for many years now. In the dictionary of lemmas ...
... user not familiar with a dialect. This problem often encountered by students of a foreign language can be solved by explaining terms not known in a foreign language by expressing the same con- cepts in a language they are familiar with. Two other ways of search (search by creating a logical query over ...
... ies от а standard language morphological transformations for lemma generation Extract definitions of verbs in a dialect џ? dictionary, given in standard language о Index inverting Table: dictionary verb @ entry related with equivalent standard language lemma of a verb Table: ...Miljana Mladenović, Ranka Stanković, Cvetana Krstev. "A WordNet Ontology in Improving Searches of Digital Dialect Dictionary" in New Trends in Databases and Information Systems: ADBIS 2017 Short Papers and Workshops - SW4CH (Semantic Web for Cultural Heritage) 767, Springer International Publishing (2017). https://doi.org/10.1007/978-3-319-67162-8_37
Multi-word Expressions for Abusive Speech Detection in Serbian
Ovaj rad predstavlja istraživanja na usavršavanju i unapređenju srpske verzije rečnika Hurtlex, višejezičnog leksikona uvredljivih reči. Posebnu pažnju posvećujemo dodavanju izraza sa više reči (polileksemskih jedinica) koji se mogu smatrati uvredljivim, jer su takvi leksički zapisi veoma važni za postizanje dobrih rezultata u mnoštvu zadataka otkrivanja uvredljivog jezika. Srpski morfološki rečnici se koriste kao osnova za čišćenje podataka i stvaranje rečnika. Istaknuta je veza sa drugim leksičkim i semantičkim resursima na srpskom jeziku i predviđena je izgradnja sistema za ...... it is clear that hate speech is a complex social and linguistic phenomenon. Abusive language and its detection have been gaining more attention recently. Caselli et al. (2020) define abusive language as ‘hurtful language that a speaker uses to insult or offend another individual or a group of individuals ...
... statements, or actions. This might include hate speech, derogatory language, profanity, toxic comments, racist and sexist statements.’ Computational processing of such language requires usage of finely-tuned, task specific language tools and resources, especially for morphologically rich and low- resource ...
... that will facilitate abusive language detection already exist. Serbian Morphological Dictionaries are certainly a staple in processing texts in Serbian (Krstev, 2008). In order to process implicitly abusive language, we need to take into account the usage of non-literal language, the rhetorical devices that ...Ranka Stanković, Jelena Mitrović, Danka Jokić, Cvetana Krstev. "Multi-word Expressions for Abusive Speech Detection in Serbian" in Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, Association for Computational Linguistics (2020)
Two approaches to compilation of bilingual multi-word terminology lists from lexical resources
In this paper, we present two approaches and the implemented system for bilingual terminology extraction that rely on an aligned bilingual domain corpus, a terminology extractor for a target language, and a tool for chunk alignment. The two approaches differ in the way terminology for the source language is obtained: the first relies on an existing domain terminology lexicon, while the second one uses a term extraction tool. For both approaches, four experiments were performed with two parameters being ...Branislava Šandrih, Cvetana Krstev, Ranka Stanković. "Two approaches to compilation of bilingual multi-word terminology lists from lexical resources" in Natural Language Engineering, Cambridge University Press (CUP) (2020). https://doi.org/10.1017/S1351324919000615
Using English Baits to Catch Serbian Multi-Word Terminology
In this paper we present the first results in bilingual terminology extraction. The hypothesis of our approach is that if for a source language domain terminology exists as well as a domain aligned corpus for a source and a target language, then it is possible to extract the terminology for a target language. Our approach relies on several resources and tools: aligned domain texts, domain terminology for a source language, a terminology extractor for a target language, and a ...aligned texts, word alignment, terminology extraction, electronic dictionaries, morphological inflection... source language domain terminology exists as well as a domain aligned corpus for a source and a target language, then it is possible to extract the terminology for a target language. Our approach relies on several resources and tools: aligned domain texts, domain terminology for a source language, a t ...
... MWEs are identified (in a source or a target language) in various ways: some authors use mor- phosyntactic patterns on lemmatized and POS-tagged texts 2In this paper we will call ‘source’ language a well-resourced language (English), and ‘target’ language a less-resourced lan- guage (Serbian). 2487 ...
... terminology extractor for a target language, and a tool for word and chunk alignment. In this first experiment a source language is English, a target language is Serbian, a domain is Library and Information Science for which a bilingual terminological dictionary exists. Our term extractor is based on ...Cvetana Krstev, Branislava Šandrih, Ranka Stanković. "Using English Baits to Catch Serbian Multi-Word Terminology" in Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018, European Language Resources Association (ELRA) (2018)
Веб-алат за управљање грађом Речника САНУ и анотација листића
Грађа на основу које се израђује Речник српскохрватског књижевног и народног језика САНУ, а која садржи материјал из преко 4.500 писаних извора и 300 рукописних збирки речи са подручја народних говора штокавског наречја, забележена је на око 5.000.000 листића. Богат лексички материјал, који обухвата књижевни и народни језик у протекла два века и на основу кога треба да се напише још најмање 15 томова Речника, пружа могућност и за разноврсна лингвистичка и ванлингвистичка истраживања. Из тог разлога се приступило ...Рада Стијовић, Ранка Станковић, Михаило Шкорић. "Веб-алат за управљање грађом Речника САНУ и анотација листића" in Rasprave Instituta za hrvatski jezik i jezikoslovlje, Institute of Croatian Language and Linguistics (2020). https://doi.org/10.31724/rihjj.46.2.32
Keyword Extraction from Parallel Abstracts of Scientific Publications
... Serbian language we use: (1) Stop-word list - prepared at the Human Language Technology Group at the University of Belgrade [30], and (2) a Serbian lemmatizer. For lemmatization, we use Serbian morphological elec- tronic dictionaries and grammars developed within the University of Bel- grade Human Language ...
... for the English language (see Table 2). This is in line with our previous findings for the Croatian language [8]. Both, Serbian and Croatian language are morpho- logically rich, and closely related languages from South Slavic language family. Unlike English, which is inflectional language and has a strict ...
... English and 46.73% for the Serbian language, if we disregard keywords that are not present in the abstracts. In case that we evaluate against the whole keyword set, the F1 scores are 40.08% and 45.71% respectively. This work shows that SBKE can be easily ported to new a language, domain and type of text in ...Slobodan Beliga, Olivera Kitanović, Ranka Stanković, Sanda Martinčić-Ipšić . "Keyword Extraction from Parallel Abstracts of Scientific Publications" in Sematic Keyword-Based Search on Structured Data Sources - Third International KEYSTONE Conference, IKC 2017 Gdańsk, Poland, September 11–12, 2017 Revised Selected Papers and COST Action IC1302 Reports, Springer (2017)
The Nooj System as Module within an Integrated Language Processing Environment
... that contains NooJ as one of its main modules. This environment named WS4LR (WorkStation for Lexical Resources) has been developed within the Human Language Technology Group (HLT) at the Faculty of Mathematics, University of Belgrade, and is aimed at manipulating heterogeneous lexical resources ...
... inflectional graphs: ‘document, dokumenta, dokumentu, dokumentom,..’ 2. Integrated environment for linguistic research 2.1. Motivation The Human Language Technology group has been developing a variety of lexical resources over a long period, reaching a considerable volume to date. These resources ...
... System as Module within an Integrated Language Processing Environment Ranka Stanković, Duško Vitas, Cvetana Krstev Дигитални репозиторијум Рударско-геолошког факултета Универзитета у Београду [ДР РГФ] The Nooj System as Module within an Integrated Language Processing Environment | Ranka Stanković ...Ranka Stanković, Duško Vitas, Cvetana Krstev. "The Nooj System as Module within an Integrated Language Processing Environment" in Proceedings of the 2007 International Nooj Conference, Cambridge Scholars Publishing (2008)
Frequency and Length of Syllables in Serbian
Marija Radojičić, Biljana Lazić, Sebastijan Kaplar, Ranka Stanković, Ivan Obradović, Ján Mačutek, Lívia Leššová (2019)Basic analyses of several properties of syllables (the rank-frequency distribution, the distribution of length, and the relation between length and frequency) in Serbian is presented. The syllabification algorithm used combines the maximum onset principle and the sonority hierarchy. Results indicate that syllables behave similarly to words as far as mathematical models are concerned, but values of parameters in models for syllables are quite different from those for words.... Syllables in Serbian 117 3. Language material Serbian is a South Slavic language. It has the official status in Serbia (exclusively) and in Bosnia and Herzegovina (as one of three languages, together with Bosnian and Croatian), and the status of a minority language in several other countries. Given ...
... above, with a general syllable definition lacking, a scientist can apply language- specific rules for syllabification (e.g. using morpheme borders as one of the criteria for syllable borders). While the application of language-specific rules is not bad per se, if one wants to compare models, parameter ...
... approach to all languages under investigation is indispensable. If a language allows only open syllables (such as Old Slavonic, cf. Rottmann, 1999), the syllabification is straightforward (provided that diphthongs – if the language under investigation contains any – can be reliably distinguished from ...Marija Radojičić, Biljana Lazić, Sebastijan Kaplar, Ranka Stanković, Ivan Obradović, Ján Mačutek, Lívia Leššová. "Frequency and Length of Syllables in Serbian" in Glottometrics (2019)
Wordnet Development Using a Multifunctional Tool
Ivan Obradović, Ranka Stanković (2007)In this paper we present a multifunctional tool for manipulating heterogeneous language resources. The tool handles electronic dictionaries, wordnets and aligned texts, and provides for their synchronous use in various tasks. We focus here on the description of the possibilities this tool offers in the development of wordnets. Besides the wordnet module which enables parallel handling of two wordnets, other modules, such as the module for morphological dictionaries and the module for aligned texts, as well as available finite ...... 3 http://www.illc.uva.nl/EuroWordNet/sample.html 4 http://nlp.fi.muni.cz/projekty/visdic/ 3. A Multifunctional Language Resource Tool 3.1 Motivation The Human Language Technology group at the University of Belgrade has been developing various lexical resources over quite a long period ...
... match in the target language, regardless of the fact whether these target language synsets have previously been retrieved from the wordnet by the user or not, and which PWN synsets do not have a match. The latter are obviously candidates for new synsets in the target language. Figure 9. ...
... dictionary and hierarchical thesaurus for a particular language, opens two critical issues. The first pertains to the organization of the conceptual network. Simply put, the issue is how to define the concepts for a particular language and how to establish links among them? In other words, ...Ivan Obradović, Ranka Stanković. "Wordnet Development Using a Multifunctional Tool" in Proceedings of the International Workshop Computer Aided Language Processing (CALP) '2007, Borovets, Bulgaria, September 2007, - (2007)
Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution
This paper explores the effectiveness of parallel stylometric document embeddings in solving the authorship attribution task by testing a novel approach on literary texts in 7 different languages, totaling in 7051 unique 10,000-token chunks from 700 PoS and lemma annotated documents. We used these documents to produce four document embedding models using Stylo R package (word-based, lemma-based, PoS-trigrams-based, and PoS-mask-based) and one document embedding model using mBERT for each of the seven languages. We created further derivations of these ...Mihailo Škorić, Ranka Stanković, Milica Ikonić Nešić, Joanna Byszuk, Maciej Eder. "Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution" in Mathematics, MDPI AG (2022). https://doi.org/10.3390/math10050838
WS4LR - a Worksation for Lexical Resources
... ivano@afrodita.rcub.bg.ac.yu Abstract In this paper we describe WS4LR, the workstation for lexical resources, a software tool developed within the Human Language Technology Group at the Faculty of Mathematics, University of Belgrade. The tool is aimed at manipulating heterogeneous lexical resources, and ...
... and runs on a personal computer under Windows 2000/XP/2003 operating system with at least 256MB of internal memory. 1 Introduction The Human Language Technology group at the Faculty of Mathematics has been developing various lexical resources over quite a long period, reaching a considerable ...
... criteria in the source language are highlighted (Figure 5). Figure 4. The form for expansion of the search criteria The user can also use the translation equivalence option which is aimed at locating equivalences in target language for occurrences found in the source language. This is done on ...Cvetana Krstev, Ranka Stanković, Duško Vitas, Ivan Obradović. "WS4LR - a Worksation for Lexical Resources" in Proceedings of the Fifth Interantional Conference on Language Resources and Evaluation, Genoa, Italy, May 2006, ELRA - European Language Resources Association (2006)
From DELA Based Dictionary to Leximirka Lexical Database
Biljana Lazić, Mihailo Škorić (2020)In this paper, we will present an approach in transforming Serbian language Morphological dictionaries from a DELA text format to a lexical database dubbed Leximirka. Considering the benefits of storing data within a database when compared to storing them in textual documents, we will outline some of the functionality that the database has made possible. We will also show how hand-made rules that use category labels lexical entries are marked with can be used to link lexical entries. ...... Framework - LMF). LMF is designed for lexicons specially designed for Natural Language Pro- cessing and Machine-Readable Dictionaries. LMF specification is represented as a subset of UML (Unified Modeling Language) language that provides lin- guistic description. The LMF consists of mandatory Core package ...
... used for natural language processing - NLP. 3 TEI 4 LMF 5 Lemon 84 Infotheca Vol. 19, No. 2, December 2019 Scientific paper The LMF prescribes a standardized framework for recording linguistic in- formation in computer lexicons and is based on the Standard ISO 24613: 2008 (Language Resource Management ...
... in the Serbian Corpus of the Serbian Language SrbCorp (version of 122 million words by Duško Vitas and Miloš Utvić)6. Information about the Corpus is stored in the KorpusMeta table. The LexicalRelation table stores information 6 Corpus of the Serbian Language – SrbCorp 86 Infotheca Vol. 19, No. ...Biljana Lazić, Mihailo Škorić. "From DELA Based Dictionary to Leximirka Lexical Database" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.4
Digital Library From A Domain Of Criminalistics As A Foundation For A Forensic Text Analysis
U ovom radu predstavljen je model koji omogućava prikupljanje, pripremu, opis metapodataka, upravljanje i eksploataciju, uključujući pretragu punog teksta dokumenata iz domena kriminalistike napisanih na srpskom jeziku. Predloženi pristup primenjuje se na veb portalu koji sakuplja različite tekstove nastale iz časopisa Akademije za kriminalistiku i policijske studije, Krivičnog zakona Srbije, konferencija „Tara“ i „Reiss“, kao i iz nekih doktorskih disertacija vezanih za ovu oblast istraživanje. Nakon obrade teksta, korpus koji sadrži preko 5500 stranica običnog teksta, kreiran je i ...... they are not (e.g. simply provocative). 8 SOFTWARE SOLUTIONS MODEL The human language processing group (HLT group) at the University of Belgrade is engaged for many years now in a task of producing various language resources9, both corpora and lexicons. Given the fact that these resources have ...
... LINGUISTICS The linguistic study of forensic texts is a part of the field of Natural Language Processing, which includes text types classification and syntax and semantic analysis of texts written in a natural language. Various texts are subject of the study: Acts of Parliament (or other law-making ...
... Sixth Interantional Conference on Language To keep development and use of the applications and resources at the same time, without frequent conversions, the strategy for the development was to support original formats used in another software tools for language resources processing (Unitex, WorNet ...Dalibor Vorkapić, Aleksandra Tomašević, Miljana Mladenović, Ranka Stanković, Nikola Vulović. "Digital Library From A Domain Of Criminalistics As A Foundation For A Forensic Text Analysis" in International Scientific Conference “Archibald Reiss Days” Thematic Conference Proceedings Of International Significance, Belgrade, 7-9 November 2017, Academy Of Criminalistic And Police Studies Belgrade (2017)
An Approach to Development of Bilingual Lexical Resources
... keywords. The paper also outlines linguistic criteria used for building language resources for French, Italian, and German, and the use of multi-term descriptors as a means to better identify the content. The Human Language Technology group at the University of Belgrade developed Bibliša (http://hlt ...
... Multilingual textual repositories, such as digital libraries of e- journals represent a specific type of language resources. Efficient search of these resources usually relies on specific language tools, which often use other available resources, such as e-dictionaries, wordnets and the like. An ...
... University of Novi Sad. 102 language resources such as grammars in the form of finite automata and transducers, as well as various lexical resources. Bibliša is able to expand search queries both morphologically and semantically, as well as to another language. One type of lexical resources ...Stanković Ranka, Obradović Ivan, Trtovac Aleksandra. "An Approach to Development of Bilingual Lexical Resources" in Proceedings of the Fifth Balkan Conference in Informatics BCI 2012, Workshop on Computational Linguistics and Natural Language Processing of Balkan Languages – CLoBL 2012, September 2012, Novi Sad : BCI (2012)
Using Query Expansion for Cross-Lingual Mathematical Terminology Extraction
Velislava Stoykova, Ranka Stanković (2018)Velislava Stoykova, Ranka Stanković. "Using Query Expansion for Cross-Lingual Mathematical Terminology Extraction" in Advances in Intelligent Systems and Computing, Springer International Publishing (2018). https://doi.org/10.1007/978-3-319-91189-2_16
Using technology for knowledge transfer between academia and enterprises
Ivan Obradović, Ranka Stanković (2014)... TEL platform consists of tools and resources: learning, language and implementation resources. Among the tools some are available open source and commercial tools, some are in-house tools developed by the University of Belgrade Human Language Technology Group. Learning resources are both academic: ...
... 4 The language support system The need for multilinguality of OER is a combined effect of globalization and European integration, favoring a holistic approach that takes into account all the languages a learner may use, as opposed to the more traditional approach looking at one language at a time ...
... 802 The language support system, whose structure is outlined in Figure 3, is based on electronic language resources, namely, lexical resources, textual resources and grammars. Bilingual dictionaries in electronic ...Ivan Obradović, Ranka Stanković. "Using technology for knowledge transfer between academia and enterprises" in Knowledge and Management Models for Sustainable Growth, Proc. of IFKAD 2014, 9th International Forum on Knowledge Asset Dynamics, 11-13 June 2013, Matera, Italy, Bari : IFKAD (2014)
GIS Application Improvement with Multilingual Lexical and Terminological Resources
... ac.rs Abstract This paper introduces the results of integration of lexical and terminological resources, most of them developed within the Human Language Technology (HLT) Group at the University of Belgrade, with the Geological information system of Serbia (GeolISS) developed at the Faculty of Mining ...
... The research described in this paper is based on an integration of lexical and terminological resources, most of them developed within the Human Language Technology (HLT) Group at the University of Belgrade, and the Geological information system of Serbia (GeolISS), developed at the Faculty ...
... tool, a workstation for language resources, named WS4LR, which greatly enhances the potential of manipulating each particular resource as well as several resources simultaneously (Krstev et al., 2008). This tool has already been successfully used for various language processing related tasks ...Ranka Stanković, Ivan Obradović, Olivera Kitanović. "GIS Application Improvement with Multilingual Lexical and Terminological Resources" in Proceedings of the 5th International Conference on Language Resources and Evaluation, LREC 2010, Valetta, Malta, May 2010, Valetta, Malta : European Language Resources Association (2010)
Development of integrated fuzzy model for mine management optimization
Miodrag Čelebić, Sanja Bajić, Dragoljub Bajić, Dejan Stevanović, Duško Torbica, Vladimir Malbašić (2023)... inaccuracies. As a result, subjective evaluation by engineers and expert experience have become increas- ingly important. Given that the natural language used by miners and geologists is most suited for relaying: knowledge and expressing; opinions, the paper tests a fuzzy optimization methodology ...
... physical and mechanical rock parameters, or environmental concerns. Likewise, the proposed methodology can be applied to consider other mining: technologies when selecting, the optimal alternative. Acknowledgements. Tbhe authors express their gratitude to the Ministry of Science, Technological ...
... 97, 89-117. | CHEN H. (2006) Applications of Fuzzy Logic in Data Mining Process. In: Bai Y., Zhuang H., Wang D. (eds), Advanced Fuzzy Logic Technologies in Industrial Ap- ications, Advances in Industrial Control, London, Springer, DOT: 10.1007,978-1- 84628-469-4 _17. BAJIĆ S., D. BAJIĆ, B. GLUŠČEVIĆ ...Miodrag Čelebić, Sanja Bajić, Dragoljub Bajić, Dejan Stevanović, Duško Torbica, Vladimir Malbašić. "Development of integrated fuzzy model for mine management optimization" in Comptes rendus de l'Académie Bulgare des Sciences (2023)
Softverski alati za korišćenje resursa za srpski jezik
Ivan Obradović, Ranka Stanković (2008)... words of a par- ticular language systematized and organized in a specific manner, are developed in various for- mats. Thus, for example, several different types of e-dictionaries, along with other lexical and textual resources, are being developed within the Human Language Technology Group, which ...
... BalkaNet languages are spoken, but also from France and Netherlands. A national development team was formed for each language, and in the case of Serbian this team was the Human Language Technology Group at the University of Belgrade. Upon the termination of this project, the development of SWN contin- ...
... with the acronym WS4LR (Workstation for Lexical Abstract: In this paper we describe how lexical resourc- es for Serbian, developed within the Human Language Technology Group, such as various types of electronic dictionaries and aligned texts, can be further refined and used for different purposes ...Ivan Obradović, Ranka Stanković. "Softverski alati za korišćenje resursa za srpski jezik" in INFOteka: časopis za informatiku i bibliotekarstvo, Belgrade, Serbia : Zajednica biblioteka univerziteta u Srbiji (2008)