Tag: semantic metrics
Beyond BLEU and ROUGE: Semantic Metrics for LLM Output Quality
Traditional metrics like BLEU and ROUGE fail to evaluate modern LLMs because they penalize valid paraphrasing. Semantic metrics like BERTScore and BLEURT measure meaning over word overlap, correlating far better with human judgment despite higher computational costs.