External Sandhi and its Relevance to Syntactic Treebanking

Kolachina, Sudheer; Sharma, Dipti Misra; Gadde, Phani; Vijay, Meher; Sangal, Rajeev; Bharati, Akshar


Título del documento:	External Sandhi and its Relevance to Syntactic Treebanking
Revista:	Polibits
Base de datos:	PERIÓDICA
Número de sistema:	000359039
ISSN:	1870-9044
Autores:	Kolachina, Sudheer¹ Sharma, Dipti Misra¹ Gadde, Phani¹ Vijay, Meher¹ Sangal, Rajeev¹ Bharati, Akshar¹
Instituciones:	¹Language Technologies Research Centre, Hyderabad, Andhra Pradesh. India
Año:	2011
Periodo:	Ene-Jun
Número:	43
País:	México
Idioma:	Inglés
Tipo de documento:	Artículo
Enfoque:	Analítico, descriptivo
Resumen en inglés	Externai sandhi is a linguistic phenomenon which refers to a set of sound changes that occur at word boundaries. These changes are similar to phonological processes such as assimilation and fusion when they apply at the level of prosody, such as in connected speech. External sandhi formation can be orthographically reflected in some languages. External sandhi formation in such languages, causes the occurrence of forms which are morphologically unanalyzable, thus posing a problem for all kind of NLP applications. In this paper, we discuss the implications that this phenomenon has for the syntactic annotation of sentences in Telugu, an Indian language with agglutinative morphology. We describe in detail, how external sandhi formation in Telugu, if not handled prior to dependency annotation, leads either to loss or misrepresentation of syntactic information in the treebank. This phenomenon, we argue, necessitates the introduction of a sandhi splitting stage in the generic annotation pipeline currently being followed for the treebanking of Indian languages. We identify one type of external sandhi widely occurring in the previous version of the Telugu treebank (version 0.2) and manually split all its instances leading to the development of a new version 0.5. We also conduct an experiment with a statistical parser to empirically verify the usefulness of the changes made to the treebank. Comparing the parsing accuracies obtained on versions 0. 2 and 0. 5 of the treebank, we observe that splitting even just one type of external sandhi leads to an increase in the overall parsing accuracies
Disciplinas:	Ciencias de la computación, Literatura y lingüística
Palabras clave:	Procesamiento de datos, Lingüística aplicada, Lingüística computacional, Fonología, Corpus parseado sintáctico
Keyword:	Computer science, Literature and linguistics, Data processing, Applied linguistics, Computing linguistics, Phonology, Syntactic treebank
Texto completo:	Texto completo (Ver HTML)

External Sandhi and its Relevance to Syntactic Treebanking

Espere un momento...