A Comparison of Methods for Identifying the Translation of Words in a Comparable Corpus: Recipes and Limits



Título del documento: A Comparison of Methods for Identifying the Translation of Words in a Comparable Corpus: Recipes and Limits
Revista: Computación y sistemas
Base de datos: PERIÓDICA
Número de sistema: 000411063
ISSN: 1405-5546
Autores: 1
1
Instituciones: 1Universite de Montreal, Departement d'Informatique, Montreal, Quebec. Canadá
Año:
Periodo: Jul-Sep
Volumen: 20
Número: 3
Paginación: 449-458
País: México
Idioma: Inglés
Tipo de documento: Artículo
Enfoque: Experimental, aplicado
Resumen en inglés Identifying translations in comparable corpora is a challenge that has attracted many researchers since a long time. It has applications in several applications including Machine Translation and Cross-lingual Information Retrieval. In this study we compare three state-of-the-art approaches for these tasks: the so-called context-based projection method, the projection of monolingual word embeddings, as well as a method dedicated to identify translations of rare words. We carefully explore the hyper-parameters of each method and measure their impact on the task of identifying the translation of English words in Wikipedia into French. Contrary to the standard practice, we designed a test case where we do not resort to heuristics in order to pre-select the target vocabulary among which to find translations, therefore pushing each method to its limit. We show that all the approaches we tested have a clear bias toward frequent words. In fact, the best approach we tested could identify the translation of a third of a set of frequent test words, while it could only translate around 10% of rare words
Disciplinas: Ciencias de la computación,
Literatura y lingüística
Palabras clave: Procesamiento de datos,
Lingüística aplicada,
Lingüística computacional,
Traducción automática,
Inducción de léxico
Keyword: Computer science,
Literature and linguistics,
Data processing,
Applied linguistics,
Computing linguistics,
Automatic translation,
Lexicon induction
Texto completo: Texto completo (Ver HTML)