Skip to content

How to Add a New Agent to CI/CD Evaluations

This guide walks you through the process of adding a new LLM agent to your automated evaluation CI/CD pipeline using GitHub Actions.

Prerequisites

Before you begin, make sure you have:

  1. A working LLM agent implementation integrated with Braintrust
  2. An evaluation dataset created in Braintrust for your agent

Step 1: Review Your Existing Workflow Structure

First, understand the structure of your current evaluation workflow. Your GitHub Actions workflow likely follows this pattern:

  1. Trigger conditions: When the workflow runs (PRs, pushes, etc.)
  2. Path filtering: Determines which agent evaluations to run based on changed files
  3. Job definitions: Separate jobs for each agent's evaluation
  4. Evaluation steps: The actual steps that run the evaluations

Step 2: Update the Path Filters

Modify your .github/workflows/braintrust-eval.yaml file to add path filters for your new agent:

jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      patient_matcher: ${{ steps.filter.outputs.patient_matcher }}
      protocol_parser: ${{ steps.filter.outputs.protocol_parser }}
      your_new_agent: ${{ steps.filter.outputs.your_new_agent }}  # Add this line
    steps:
      - uses: actions/checkout@v3
      - uses: dorny/paths-filter@v2
        id: filter
        with:
          filters: |
            patient_matcher:
              - 'trially_agents/trially_agents/evals/scores/patient_matcher.py'
              - 'trially_agents/trially_agents/prompts/patient_matcher.py'
              - 'trially_agents/trially_agents/patient_matcher.py'
            protocol_parser:
              - 'trially_agents/trially_agents/evals/scores/protocol_parser.py'
              - 'trially_agents/trially_agents/prompts/protocol_parser.py'
              - 'trially_agents/trially_agents/protocol_parser.py'
            your_new_agent:  # Add this section
              - 'trially_agents/trially_agents/evals/scores/your_new_agent.py'
              - 'trially_agents/trially_agents/prompts/your_new_agent.py'
              - 'trially_agents/trially_agents/your_new_agent.py'

This configuration ensures that your new agent's evaluation will only run when relevant files are changed.

Step 4: Add a New Evaluation Job

Add a new evaluation job to your workflow:

  - name: Set matrix
    id: set-matrix
    run: |
      # Build dynamic matrix based on which agents have changes
      MATRIX="{\"include\":["
      HAS_CHANGES=false

      if [[ "${{ steps.filter.outputs.patient_matcher }}" == "true" ]]; then
        MATRIX="${MATRIX}{\"agent\":\"patient_matcher\",\"has_eval_script\":true},"
        HAS_CHANGES=true
      fi

      if [[ "${{ steps.filter.outputs.protocol_parser }}" == "true" ]]; then
        MATRIX="${MATRIX}{\"agent\":\"protocol_parser\",\"has_eval_script\":true},"
        HAS_CHANGES=true
      fi

      if [[ "${{ steps.filter.outputs.your_new_agent }}" == "true" ]]; then
        MATRIX="${MATRIX}{\"agent\":\"your_new_agent\",\"has_eval_script\":true},"
        HAS_CHANGES=true
      fi

      # Remove trailing comma if exists
      MATRIX="${MATRIX%,}"

      MATRIX="${MATRIX}]}"
      echo "matrix=${MATRIX}" >> $GITHUB_OUTPUT
      echo "has_changes=${HAS_CHANGES}" >> $GITHUB_OUTPUT

Step 3: Create your eval script

Create a new eval script in trially_agents/trially_agents/evals/new_agent/ci/your_new_agent_eval.py

from braintrust import Eval, init_dataset
from trially_agents.base import BraintrustProjectName
from trially_agents.config.general import general_config
from trially_agents.prompts.your_new_agent import your_task_prompt
from trially_agents.evals.scores.your_new_agent import (
    # Import your scoring functions here, for example:
    accuracy_scorer,
    quality_judge,
    completeness_scorer,
)

project_name = BraintrustProjectName.YOUR_NEW_AGENT.from_environment(
    general_config.environment
)
experiment_name = "your_task_name"

Eval(
    project_name,
    experiment_name=experiment_name,
    data=init_dataset(project_name, "your-task-dataset-v1"),
    task=your_task_prompt.invoke,
    scores=[
        # Add your scoring functions here
        accuracy_scorer,
        quality_judge,
        completeness_scorer,
    ],
)

Step 4: Test Your Workflow

Test your workflow by making changes to your new agent's files:

  1. Create a new branch: git checkout -b test-new-agent-eval
  2. Make a small change to one of your agent files
  3. Commit and push:
    git add .
    git commit -m "Test new agent evaluation"
    git push origin test-new-agent-eval
    
  4. Create a PR to your main branch
  5. Check that the GitHub Actions workflow is triggered and runs your agent's evaluation

Step 5: Review the PR Comment

When the workflow completes, you should see a comment on your PR with:

  1. A summary of the evaluation results
  2. Comparison with the baseline (if available)
  3. Links to the detailed results in Braintrust

Troubleshooting

Common Issues

  1. Job not running: Check that your path filters match the actual file paths of your agent
  2. Script errors: Look at the job logs for Python errors

Best Practices

  1. Maintain Path Filters: Update path filters when you reorganize code
  2. Version Your Datasets: When making significant changes to your agent, create a new dataset version
  3. Review Performance Changes: Always carefully review performance changes before merging PRs
  4. Document Thresholds: Keep documentation of your metric thresholds and what they mean
  5. Monitor Job Duration: Keep an eye on workflow duration and optimize if evaluations take too long