How to Add a New Agent to CI/CD Evaluations
This guide walks you through the process of adding a new LLM agent to your automated evaluation CI/CD pipeline using GitHub Actions.
Prerequisites
Before you begin, make sure you have:
- A working LLM agent implementation integrated with Braintrust
- An evaluation dataset created in Braintrust for your agent
Step 1: Review Your Existing Workflow Structure
First, understand the structure of your current evaluation workflow. Your GitHub Actions workflow likely follows this pattern:
- Trigger conditions: When the workflow runs (PRs, pushes, etc.)
- Path filtering: Determines which agent evaluations to run based on changed files
- Job definitions: Separate jobs for each agent's evaluation
- Evaluation steps: The actual steps that run the evaluations
Step 2: Update the Path Filters
Modify your .github/workflows/braintrust-eval.yaml file to add path filters for your new agent:
jobs:
detect-changes:
runs-on: ubuntu-latest
outputs:
patient_matcher: ${{ steps.filter.outputs.patient_matcher }}
protocol_parser: ${{ steps.filter.outputs.protocol_parser }}
your_new_agent: ${{ steps.filter.outputs.your_new_agent }} # Add this line
steps:
- uses: actions/checkout@v3
- uses: dorny/paths-filter@v2
id: filter
with:
filters: |
patient_matcher:
- 'trially_agents/trially_agents/evals/scores/patient_matcher.py'
- 'trially_agents/trially_agents/prompts/patient_matcher.py'
- 'trially_agents/trially_agents/patient_matcher.py'
protocol_parser:
- 'trially_agents/trially_agents/evals/scores/protocol_parser.py'
- 'trially_agents/trially_agents/prompts/protocol_parser.py'
- 'trially_agents/trially_agents/protocol_parser.py'
your_new_agent: # Add this section
- 'trially_agents/trially_agents/evals/scores/your_new_agent.py'
- 'trially_agents/trially_agents/prompts/your_new_agent.py'
- 'trially_agents/trially_agents/your_new_agent.py'
This configuration ensures that your new agent's evaluation will only run when relevant files are changed.
Step 4: Add a New Evaluation Job
Add a new evaluation job to your workflow:
- name: Set matrix
id: set-matrix
run: |
# Build dynamic matrix based on which agents have changes
MATRIX="{\"include\":["
HAS_CHANGES=false
if [[ "${{ steps.filter.outputs.patient_matcher }}" == "true" ]]; then
MATRIX="${MATRIX}{\"agent\":\"patient_matcher\",\"has_eval_script\":true},"
HAS_CHANGES=true
fi
if [[ "${{ steps.filter.outputs.protocol_parser }}" == "true" ]]; then
MATRIX="${MATRIX}{\"agent\":\"protocol_parser\",\"has_eval_script\":true},"
HAS_CHANGES=true
fi
if [[ "${{ steps.filter.outputs.your_new_agent }}" == "true" ]]; then
MATRIX="${MATRIX}{\"agent\":\"your_new_agent\",\"has_eval_script\":true},"
HAS_CHANGES=true
fi
# Remove trailing comma if exists
MATRIX="${MATRIX%,}"
MATRIX="${MATRIX}]}"
echo "matrix=${MATRIX}" >> $GITHUB_OUTPUT
echo "has_changes=${HAS_CHANGES}" >> $GITHUB_OUTPUT
Step 3: Create your eval script
Create a new eval script in trially_agents/trially_agents/evals/new_agent/ci/your_new_agent_eval.py
from braintrust import Eval, init_dataset
from trially_agents.base import BraintrustProjectName
from trially_agents.config.general import general_config
from trially_agents.prompts.your_new_agent import your_task_prompt
from trially_agents.evals.scores.your_new_agent import (
# Import your scoring functions here, for example:
accuracy_scorer,
quality_judge,
completeness_scorer,
)
project_name = BraintrustProjectName.YOUR_NEW_AGENT.from_environment(
general_config.environment
)
experiment_name = "your_task_name"
Eval(
project_name,
experiment_name=experiment_name,
data=init_dataset(project_name, "your-task-dataset-v1"),
task=your_task_prompt.invoke,
scores=[
# Add your scoring functions here
accuracy_scorer,
quality_judge,
completeness_scorer,
],
)
Step 4: Test Your Workflow
Test your workflow by making changes to your new agent's files:
- Create a new branch:
git checkout -b test-new-agent-eval - Make a small change to one of your agent files
- Commit and push:
- Create a PR to your main branch
- Check that the GitHub Actions workflow is triggered and runs your agent's evaluation
Step 5: Review the PR Comment
When the workflow completes, you should see a comment on your PR with:
- A summary of the evaluation results
- Comparison with the baseline (if available)
- Links to the detailed results in Braintrust
Troubleshooting
Common Issues
- Job not running: Check that your path filters match the actual file paths of your agent
- Script errors: Look at the job logs for Python errors
Best Practices
- Maintain Path Filters: Update path filters when you reorganize code
- Version Your Datasets: When making significant changes to your agent, create a new dataset version
- Review Performance Changes: Always carefully review performance changes before merging PRs
- Document Thresholds: Keep documentation of your metric thresholds and what they mean
- Monitor Job Duration: Keep an eye on workflow duration and optimize if evaluations take too long