Rule-based Automatic Multi-word Term Extraction and Lemmatization ⚒ Радови ⚒ Др РГФ

Rule-based Automatic Multi-word Term Extraction and Lemmatization

Објеката

Тип: Рад у зборнику
Верзија рада: објављена верзија
Језик: енглески
Креатор: Ranka Stanković, Cvetana Krstev, Ivan Obradović, Biljana Lazić, Aleksandra Trtovac
Извор: Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, Portorož, Slovenia, 23--28 May 2016
Уредник: Nicoletta Calzolari et al.
Издавач: European Language Resources Association
Датум издавања: 2016
Сажетак: In this paper we present a rule-based method for multi-word term extraction that relies on extensive lexical resources in the form of electronic dictionaries and finite-state transducers for modelling various syntactic structures of multi-word terms. The same technology is used for lemmatization of extracted multi-word terms, which is unavoidable for highly inflected languages in order to pass extracted data to evaluators and subsequently to terminological e-dictionaries and databases. The approach is illustrated on a corpus of Serbian texts from the mining domain containing more than 600,000 simple word forms. Extracted and lemmatized multi-word terms are filtered in order to reject falsely offered lemmas and then ranked by introducing measures that combine linguistic and statistical information (C-Value, T-Score, LLR, and Keyness). Mean average precision for retrieval of MWU forms ranges from 0.789 to 0.804, while mean average precision of lemma production ranges from 0.956 to 0.960. The evaluation showed that 94% of distinct multi-word forms were evaluated as proper multi-word units, and among them 97% were associated with correct lemmas.
почетак странице: 507
крај странице: 514
isbn: 978-2-9517408-9-1
Subject: term extraction, terminology, multi-word units, lemmatization, finite-state transducers
uri: http://www.lrec-conf.org/proceedings/lrec2016/pdf/1033_Paper.pdf
Шира категорија рада: M30
Ужа категорија рада: M33
Права: Отворен приступ
Лиценца: Creative Commons – Attribution-NonComercial-No Derivative Works 4.0 International
Формат: .pdf
ORCID: https://orcid.org/0000-0001-5123-6273; https://orcid.org/0000-0001-5123-6273; https://orcid.org/0000-0002-9103-3902

Скупови објеката: Ранка Станковић; Иван Обрадовић; Биљана Рујевић; Radovi istraživača

Медија: pdf

Ranka Stanković, Cvetana Krstev, Ivan Obradović, Biljana Lazić, Aleksandra Trtovac. "Rule-based Automatic Multi-word Term Extraction and Lemmatization" in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, Portorož, Slovenia, 23--28 May 2016, European Language Resources Association (2016)