Improving Corpus Annotation Quality Using Word Embedding Models



Título del documento: Improving Corpus Annotation Quality Using Word Embedding Models
Revista: Polibits
Base de datos: PERIÓDICA
Número de sistema: 000402943
ISSN: 1870-9044
Autores: 1
Instituciones: 1Pazmany Peter Catholic University, Faculty of Information Technology and Bionics, Budapest. Hungría
Año:
Periodo: Ene-Jun
Número: 53
Paginación: 49-53
País: México
Idioma: Inglés
Tipo de documento: Artículo
Enfoque: Analítico
Resumen en inglés Web-crawled corpora contain a significant amount of noise. Automatic corpus annotation tools introduce even more noise performing erroneous language identification or encoding detection, introducing tokenization and lemmatization errors and adding erroneous tags or analyses to the original words. Our goal with the methods presented in this article was to use word embedding models to reveal such errors and to provide correction procedures. The evaluation focuses on analyzing and validating noun compounds identifying bogus compound analyses, recognizing and concatenating fragmented words, detecting erroneously encoded text, restoring accents and handling the combination of these errors in a Hungarian web-crawled corpus
Disciplinas: Ciencias de la computación,
Literatura y lingüística
Palabras clave: Inteligencia artificial,
Procesamiento de datos,
Lingüística aplicada,
Procesamiento de lenguaje natural,
Lingüística computacional,
Modelo de espacio vectorial
Keyword: Computer science,
Literature and linguistics,
Artificial intelligence,
Data processing,
Applied linguistics,
Natural language processing,
Computing linguistics,
Vector space model
Texto completo: Texto completo (Ver PDF)