AI Feedback Evaluation

Spellbook offers the unique feature of using LLM-assisted feedback to evaluate your app variants, giving scores and detailed explanations for each row of data evaluated on. It is a great way to quickly evaluate generative models without human labelers in the loop. This is based off of Vicuna's evaluation pipeline, which you can read more about here.

After selecting the variant and dataset to evaluate on, you will enter a more detailed configuration page. Here, you can set the specific criteria to evaluate your app on. This criteria will guide the LLM in its evaluation of your app variant.

Before starting the evaluation, you can observe the effects of your custom criteria on the evaluation results. Clicking Preview AI Feedback will run inference with your variant on a random subset of your dataset, then run evaluation with the outputs and the criteria you defined. For each data row, you will see the output of your variant, the score that the LLM evaluator assigned the output, and an explanation on how that score was assigned.

After evaluation is finished, you can download the full set of results, including scores and explanations, in the Download Full Results button.