Pre-Processing of English-Hindi Corpus for Statistical Machine Translation



Título del documento: Pre-Processing of English-Hindi Corpus for Statistical Machine Translation
Revista: Computación y Sistemas
Base de datos: PERIÓDICA
Número de sistema: 000423295
ISSN: 1405-5546
Autores: 1
2
Instituciones: 1Centre for Development of Advanced Computing, Noida, Uttar Pradesh. India
2KIIT Group of Institutions, Bhondsi, Gurugram. India
Año:
Periodo: Oct-Dic
Volumen: 21
Número: 4
País: México
Idioma: Inglés
Tipo de documento: Artículo
Enfoque: Aplicado, descriptivo
Resumen en inglés Corpus may be considered as fuel for the data driven approaches of machine translation. Parallel corpus building is a labour intensive task, which makes it a costly and scarce resource. Full potential of available data needs to be exploited and this can be ensured by removing different types of inconsistencies as being faced throughout the NLP domain. The paper presented here describes the experiments carried out on corpus text pre-processing for building the baseline Statistical Machine Translation (SMT) system. Text pre-processing performed here is classified in two stages – i. the first one relates to handling of orthographic representation of content and ii. the second stage relates to handling of non-lexical words. The first stage covers punctuation symbols, casing, word spellings and their normalization while second stage covers handling of numbers and named entities (NEs) applied on the best settings observed in first stage. The motivation behind performing these experiments was to derive a relationship and gauge the extent of pre-processing the corpus, thereby building a considerably optimized baseline SMT system. This baseline system would provide platform for performing further experiments with different syntactic and semantic factors in future. The findings presented here is for English-Hindi language pair, however, the concept of pre-processing is language neutral and can be transcended to any other language pair. The best performance is reported with retaining the punctuation symbols, lower-cased English corpus and spell normalized Hindi corpus for English to Hindi translation. Further to these, in the second stage of experiments, handling numbers and Named Entities have been described wherein these are mapped to unique class labels. The impact of these experiments have been explained with their appropriateness for the concerned language pair
Disciplinas: Ciencias de la computación,
Literatura y lingüística
Palabras clave: Lingüística aplicada,
Procesamiento de datos,
Traducción automática,
Preprocesamiento,
Normalización,
Reconocimiento de entidades nombradas
Keyword: Applied linguistics,
Data processing,
Machine translation,
Preprocessing,
Normalization,
Named entity recognition
Texto completo: Texto completo (Ver HTML) Texto completo (Ver PDF)