Character Embedding for Language Identification in Hindi-English Code-mixed Social Media Text

Veena, P. V; Anand Kumar, M; Soman, K. P


Título del documento:	Character Embedding for Language Identification in Hindi-English Code-mixed Social Media Text
Revista:	Computación y sistemas
Base de datos:	PERIÓDICA
Número de sistema:	000423268
ISSN:	1405-5546
Autores:	Veena, P. V¹ Anand Kumar, M¹ Soman, K. P¹
Instituciones:	¹Amrita University, Amrita School of Engineering, Coimbatore, Tamil Nadu. India
Año:	2018
Periodo:	Ene-Mar
Volumen:	22
Número:	1
País:	México
Idioma:	Inglés
Tipo de documento:	Artículo
Enfoque:	Analítico, descriptivo
Resumen en inglés	Social media platforms are now widely used by the people to express their opinion or interest. The language used by the users in social media earlier was purely English. Code-mixed text, i.e., mixing of two or more languages, is commonly seen now. In code-mixed data, one language will be written using another language script. So to process such code-mixed text, identification of language used in each word is important for language processing. The main objective of the work is to propose a technique for identifying the language of Hindi-English code-mixed data used in three social media platforms namely, Facebook, Twitter, and WhatsApp. The classification of Hindi-English code-mixed data into Hindi, English, Named Entity, Acronym, Universal, Mixed (Hindi along with English) and Undefined tags were performed. Popular word embedding features were used for the representation of each word. Two kinds of embedding features were considered - word-based embedding features and character-based context features. The proposed method was done with the addition of context information along with the embedding features. A well-known machine learning classifier, Support Vector Machine was used to train and test the system. The work on Language Identification in code-mixed text using character-based embedding is a novel approach and shows promising results
Disciplinas:	Literatura y lingüística, Ciencias de la computación
Palabras clave:	Lingüística aplicada, Identificación de idiomas, Mezcla de códigos, Inserción de palabras, Máquinas de soporte vectorial, Adición de contexto
Keyword:	Applied linguistics, Language identification, Code mixing, Word embedding, Support vector machines, Context appending
Texto completo:	Texto completo (Ver HTML) Texto completo (Ver PDF)

Character Embedding for Language Identification in Hindi-English Code-mixed Social Media Text

Espere un momento...