A New Proposal for Evaluating Web Page Cleaning Tools



Título del documento: A New Proposal for Evaluating Web Page Cleaning Tools
Revista: Computación y sistemas
Base de datos:
Número de sistema: 000560382
ISSN: 1405-5546
Autores: 1
2
Instituciones: 1Sorbonne University, Paris. Francia
2Paris XIII University, Villetaneuse. Francia
Año:
Periodo: Oct-Dic
Volumen: 22
Número: 4
Paginación: 1249-1258
País: México
Idioma: Inglés
Resumen en inglés In this article, we tackle the problem of evaluation of Web Content Extraction tools. This task is seldom studied in the literature although it has important consequences on the linguistic processing of web-based corpora. Here, we compare two types of evaluation. Firstly, an intrinsic (content-based) evaluation which is carried out in a multilingual setting (five languages). Secondly, an extrinsic (task-based) evaluation on the same corpus by studying the effects of the cleaning step on the performances of an NLP pipeline. We show that in the intrinsic evaluation, the results are not consistent with extrinsic evaluation results. We also show that the results differ greatly in the studied languages. We conclude that the choice of a web page cleaning tool should be made with respect to the task that is tackled rather than the performances observed through the intrinsic evaluation scheme.
Disciplinas: Ciencias de la computación
Palabras clave: Inteligencia artificial
Keyword: Corpus,
Multilingual corpora,
Web content extraction,
Web page cleaning,
Evaluation,
Classification,
Artificial intelligence
Texto completo: Texto completo (Ver HTML) Texto completo (Ver PDF)