Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model



Título del documento: Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model
Revista: Computación y sistemas
Base de datos: PERIÓDICA
Número de sistema: 000379430
ISSN: 1405-5546
Autores: 1
1
1
2
Instituciones: 1Instituto Politécnico Nacional, Centro de Investigación, en Computación, México, Distrito Federal. México
2Benemérita Universidad Autónoma de Puebla, Facultad de Ciencias de la Computación, Puebla. México
Año:
Periodo: Jul-Sep
Volumen: 18
Número: 3
Paginación: 491-504
País: México
Idioma: Inglés
Tipo de documento: Artículo
Enfoque: Analítico, descriptivo
Resumen en inglés We show how to consider similarity between features for calculation of similarity of objects in the Vector Space Model (VSM) for machine learning algorithms and other classes of methods that involve similarity between objects. Unlike LSA, we assume that similarity between features is known (say, from a synonym dictionary) and does not need to be learned from the data. We call the proposed similarity measure soft similarity. Similarity between features is common, for example, in natural language processing: words, n-grams, or syntactic n-grams can be somewhat different (which makes them different features) but still have much in common: for example, words "play" and "game" are different but related. When there is no similarity between features then our soft similarity measure is equal to the standard similarity. For this, we generalize the well-known cosine similarity measure in VSM by introducing what we call "soft cosine measure". We propose various formulas for exact or approximate calculation of the soft cosine measure. For example, in one of them we consider for VSM a new feature space consisting of pairs of the original features weighted by their similarity. Again, for features that bear no similarity to each other, our formulas reduce to the standard cosine measure. Our experiments show that our soft cosine measure provides better performance in our case study: entrance exams question answering task at CLEF. In these experiments, we use syntactic n-grams as features and Levenshtein distance as the similarity between n-grams, measured either in characters or in elements of n-grams
Disciplinas: Ciencias de la computación,
Literatura y lingüística
Palabras clave: Inteligencia artificial,
Lingüística aplicada,
Similitud suave,
Modelo espacial de vectores,
Coseno,
Minería de datos
Keyword: Computer science,
Literature and linguistics,
Artificial intelligence,
Applied linguistics,
Soft similarity,
Vector space model,
Cosine,
Data mining
Texto completo: Texto completo (Ver HTML)