POS Tagging without a Tagger: Using Aligned Corpora for Transferring Knowledge to Under-Resourced Languages



Título del documento: POS Tagging without a Tagger: Using Aligned Corpora for Transferring Knowledge to Under-Resourced Languages
Revue: Computación y sistemas
Base de datos: PERIÓDICA
Número de sistema: 000410217
ISSN: 1405-5546
Autores: 1
1
1
Instituciones: 1University of Sfax, MIRACL Laboratory, Sfax. Túnez
Año:
Periodo: Oct-Dic
Volumen: 20
Número: 4
Paginación: 667-679
País: México
Idioma: Inglés
Tipo de documento: Artículo
Enfoque: Experimental, aplicado
Resumen en inglés Almost all languages lack sufficient resources and tools for developing Human Language Technologies (HLT). These technologies are mostly developed for languages for which large resources and tools are available. In this paper, we deal with the under-resourced languages, which can benefit from the available resources and tools to develop their own HLT. We consider as an example the POS tagging task, which is among the most primordial Natural Language Processing tasks. The task is importatn because it assigns to word tags that highlight their morphological features by considering the corresponding contexts. The solution that we propose in this research work, is based on the use of aligned parallel corpus as a bridge between a rich-resourced language and an under-resourced language. This kind of corpus is usually available. The rich-resourced language side of this corpus is annotated first. These POS-annotations are then exploited to predict the annotation on the under-resourced language side by using alignment training. After this training step, we obtain a matching table between the two languages, which is exploited to annotate an input text. The experimentation of the proposed approach is performed for a pair of languages: English as a rich-resourced language and Arabic as an under-resourced language. We used the IWSLT10 training corpus and English TreeTagger 15. The approach was evaluated on the test corpus extracted from the IWSLT08 and obtained F-score of 89%. It can be extrapolated to the other NLP tasks
Disciplinas: Ciencias de la computación,
Literatura y lingüística
Palabras clave: Lingüística aplicada,
Lingüística computacional,
Alineación,
Marcaje,
Cuerpos paralelos
Keyword: Computer science,
Literature and linguistics,
Applied linguistics,
Computing linguistics,
Alignment,
Tagging,
Parallel corpus
Texte intégral: Texto completo (Ver HTML)