More Effective Boilerplate Removal-the GoldMiner Algorithm

Endredy, Istvan; Novak, Attila


Título del documento:	More Effective Boilerplate Removal-the GoldMiner Algorithm
Revista:	Polibits
Base de datos:	PERIÓDICA
Número de sistema:	000374546
ISSN:	1870-9044
Autores:	Endredy, Istvan¹ Novak, Attila¹
Instituciones:	¹Pazmany Peter Catholic University, Faculty of Information Technology and Bionics, Budapest. Hungría
Año:	2013
Periodo:	Jul-Dic
Número:	48
Paginación:	79-83
País:	México
Idioma:	Inglés
Tipo de documento:	Artículo
Enfoque:	Aplicado, descriptivo
Resumen en inglés	The ever-increasing web is an important source for building large-scale corpora. However, dynamically generated web pages often contain much irrelevant and duplicated text, which impairs the quality of the corpus. To ensure the high quality of web-based corpora, a good boilerplate removal algorithm is needed to extract only the relevant content from web pages. In this article, we present an automatic text extraction procedure, GoldMiner, which by enhancing a previously published boilerplate removal algorithm, minimizes the occurrence of irrelevant duplicated content in corpora, and keeps the text more coherent than previous tools. The algorithm exploits similarities in the HTML structure of pages coming from the same domain. A new evaluation document set (CleanPortalEval) is also presented, which can demonstrate the power of boilerplate removal algorithms for web portal pages
Disciplinas:	Ciencias de la computación
Palabras clave:	Procesamiento de datos, Tecnología de la información, Minería de texto, Análisis de textos, Algoritmos, Extracción de contenidos
Keyword:	Computer science, Data processing, Information technology, Text mining, Text analysis, Contents extraction, Algorithms
Texto completo:	Texto completo (Ver HTML)

More Effective Boilerplate Removal-the GoldMiner Algorithm

Espere un momento...