Can We Trust AI Quality Scores?

A multi-model evaluation of RECEIPT's usefulness assessment with Constitutional AI self-critique.

Evaluation Harness

Run the eval to generate a multi-model quality scoring report with Constitutional AI self-critique.