Improving the Boilerpipe Algorithm for Boilerplate Removal in News Articles Using HTML Tree Structure

Viveros Jiménez, Francisco; Sanchez Perez, Miguel A; Gómez Adorno, Helena; Posadas Durán, Juan Pablo; Sidorov, Grigori; Gelbukh, Alexander


Título del documento:	Improving the Boilerpipe Algorithm for Boilerplate Removal in News Articles Using HTML Tree Structure
Revista:	Computación y sistemas
Base de datos:
Número de sistema:	000560162
ISSN:	1405-5546
Autores:	Viveros Jiménez, Francisco¹ Sanchez Perez, Miguel A¹ Gómez Adorno, Helena¹ Posadas Durán, Juan Pablo² Sidorov, Grigori¹ Gelbukh, Alexander¹
Instituciones:	¹Instituto Politécnico Nacional, Centro de Investigación en Computación, Ciudad de México. México ²Instituto Politécnico Nacional, Escuela Superior de Ingeniería Mecánica y Eléctrica, Ciudad de México. México
Año:	2018
Periodo:	Abr-Jun
Volumen:	22
Número:	2
Paginación:	483-489
País:	México
Idioma:	Inglés
Tipo de documento:	Artículo
Resumen en inglés	It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. Moreover, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be not possible. Therefore, we need to filter the non-relevant text of documents. The automatic extraction of relevant text in on-line text (news articles, etc.), is not a trivial task. There are many algorithms for this purpose described in the literature. One of the most popular ones is Boilerpipe and its performance is one of the best. In this paper, we present a method, which improves the precision of the Boilerpipe algorithm using the HTML tree for selection of the relevant content. Our filter greatly increases precision (at least 15%), at the cost of some recall, resulting in an overall F1-measure improvement (around 5%). We make the experiments for the news articles using our own corpus of 2,400 news in Spanish and 1,000 in English.
Disciplinas:	Ciencias de la computación
Palabras clave:	Programación, Limpieza de datos, Precisión, Arbol HTML, Extracción de noticias, Boilerpipe, Algoritmos
Keyword:	Data cleaning, Precision, News extraction, HTML tree structure, Boilerpipe, Programming, Algorithms
Texto completo:	Texto completo (Ver HTML) Texto completo (Ver PDF)

Improving the Boilerpipe Algorithm for Boilerplate Removal in News Articles Using HTML Tree Structure

Espere un momento...