Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining
There is an increasing focus on model-based dialog evaluation metrics such as ADEM, RUBER, and the more recent BERT-based metrics. These models aim to assign a high score to all relevant responses and a low score to all …
