How to Debug LLM Agents Using Braintrust

This guide explains how to effectively debug LLM agents using Braintrust's tracing and monitoring capabilities.

When to Use Braintrust for Debugging

Leverage Braintrust for debugging when:

You notice performance issues in evaluation metrics
You need to understand why a specific test case is failing
You want to trace the full lifecycle of an LLM interaction
You need to analyze token usage patterns or latency issues
You're comparing behavior between different agent versions

Prerequisites

Braintrust account with access to your project
BRAINTRUST_API_KEY environment variable set
Basic familiarity with the LLM agent architecture

Accessing Traces

Step 1: Run Evaluation with Tracing Enabled

To debug an agent, first run an evaluation with tracing enabled:

# For running a specific evaluation with tracing enabled
export BRAINTRUST_TRACING=true
python -m trially_agents.patient_matcher

Step 2: Navigate to Experiment Results

Visit the Braintrust dashboard: https://www.braintrust.dev
Select your project (e.g., patient_matcher_development)
Find your experiment in the list and click on it

Step 3: View Experiment Details

In the experiment view, you'll see:

Overall metrics and scores
A list of test cases with inputs, outputs, and scores
Trace information for each test case

Step 4: Analyze Trace Data

For a specific test case go to the "Logs" tab:

Click on the test case row to expand it
Look for the "Trace" tab in the expanded view
This shows all spans of the LLM API calls with detailed information

Understanding Span Information

Each span contains:

Input: The input to the LLM
Output: The output of the LLM
Expected: The expected output
Metadata: Context information about the request

Troubleshooting Common Issues

Missing Trace Data

If trace data is missing:

Verify BRAINTRUST_API_KEY is correctly set
Ensure BRAINTRUST_TRACING=true is set on LLM calls

Check if you're using the traced client:

# Use traced client (captures data)
response = prompt.invoke({"input": "query"})

# Instead of regular client (doesn't capture data)
response = prompt.invoke({"input": "query"}, trace=False)