Output details
11 - Computer Science and Informatics
University of Aberdeen
An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems
<22>This paper presents an empirical investigation into the validity of corpus-based evaluation metrics such as BLEU for evaluating Natural Language Generation (NLG) systems. It is the most careful and detailed such study yet performed, and is helping to shape the NLG community’s perspective on using corpus-based evaluation metrics, especially in the context of the Generation Challenges series of shared NLG tasks. In addition, the experimental design presented in the paper for human ratings-based evaluations of NLG systems has been adapted and used by other NLG researchers, who are looking for a rigorous design for such evaluations.