Highly Language-Independent Word Lemmatization Using a Machine-Learning Classifier

Akhmetov, Iskander; Pak, Alexandr; Ualiyeva, Irina; Gelbukh, Alexander


Título del documento:	Highly Language-Independent Word Lemmatization Using a Machine-Learning Classifier
Revista:	Computación y sistemas
Base de datos:
Número de sistema:	000560536
ISSN:	1405-5546
Autors:	Akhmetov, Iskander¹ Pak, Alexandr¹ Ualiyeva, Irina¹ Gelbukh, Alexander³
Institucions:	¹Institute of Information and Computational Technologies, Almaty. Kazajistán ²Kazakh-British Technical University, Almaty. Kazajistán ³Instituto Politécnico Nacional, Mexico City. México
Any:	2020
Període:	Jul-Sep
Volum:	24
Número:	3
Paginació:	1353-1364
País:	México
Idioma:	Inglés
Resumen en inglés	Lemmatization is a process of finding the base morphological form (lemma) of a word. It is an important step in many natural language processing, information retrieval, and information extraction tasks, among others. We present an open-source language-independent lemmatizer based on the Random Forest classification model. This model is a supervised machine-learning algorithm with decision trees that are constructed corresponding to the grammatical features of the language. This lemmatizer does not require any manual work for hard-coding of the rules, and at the same time it is simple and interpretable. We compare the performance of our lemmatizer with that of the UDPipe lemmatizer on twenty-two out of twenty-five languages we work on for which UDPipe has models. Our lemmatization method shows good performance on different languages from various language groups, and it is easily extensible to other languages. The source code of our lemmatizer is publicly available.
Disciplines	Ciencias de la computación
Paraules clau:	Inteligencia artificial
Keyword:	Lemmatization, Natural language processing, Text preprocessing, Random Forest classifier, Decision Tree classifier, Artificial intelligence
Text complet:	Texto completo (Ver HTML) Texto completo (Ver PDF)

Highly Language-Independent Word Lemmatization Using a Machine-Learning Classifier

Esperi un moment...