Automatically assess the quality of your AI outputs using Basalt Evaluations.
Evaluations allow you to automatically assess the quality and characteristics of your AI outputs. By integrating evaluators into your workflows, you can monitor for issues, gather metrics, and ensure your AI-generated content meets your standards.
Basalt’s evaluation system works by attaching evaluators to your traces and generations. These evaluators run automatically when generations complete, analyzing the input and output to produce objective metrics about the content.
Evaluations help you answer questions like: “Is this content factually accurate?”, “Does it contain harmful content?”, “Is it relevant to the original query?”, or any other quality metrics important to your application.
To use evaluations in Basalt, you’ll need to create your own evaluators through the Basalt application interface. These evaluators can be designed to measure specific aspects of your AI-generated content that are important to your use case.
Evaluators are created and managed through the Basalt app. Once created, they can be referenced in your code by their slug.
The sampleRate parameter controls how often evaluations are run:
Value range: 0.0 to 1.0 (0% to 100%)
Default value: 0.1 (10%)
Sample rates allow you to balance evaluation coverage with cost efficiency. For example:
sampleRate: 1.0 - Evaluate every trace (100%)
sampleRate: 0.5 - Evaluate approximately half of all traces (50%)
sampleRate: 0.1 - Evaluate approximately one in ten traces (10%)
sampleRate: 0.01 - Evaluate approximately one in a hundred traces (1%)
Sampling is applied at the trace level. When a trace is selected for evaluation, all evaluators assigned to that trace and its generations will run. This ensures you get complete evaluation data for the sampled traces rather than patchy data across all traces.
In experiment mode (when a trace is attached to an experiment), the sample rate is always set to 1.0 (100%) regardless of the configured value, ensuring complete evaluation coverage for experimental workflows.
Start with a baseline: Begin with a modest set of evaluators to establish baseline metrics before adding more.
Choose appropriate sample rates: High-volume applications may only need a small sample rate (1-10%), while critical applications might warrant higher rates (50-100%).
Combine evaluators strategically: Different evaluators measure different aspects of quality; use combinations that address your specific concerns.
Use contextual evaluators: Different content types may need different evaluations; tailor your evaluator selection to the specific content.