Skip to content

How to Debug LLM Agents Using Braintrust

This guide explains how to effectively debug LLM agents using Braintrust's tracing and monitoring capabilities.

When to Use Braintrust for Debugging

Leverage Braintrust for debugging when:

  • You notice performance issues in evaluation metrics
  • You need to understand why a specific test case is failing
  • You want to trace the full lifecycle of an LLM interaction
  • You need to analyze token usage patterns or latency issues
  • You're comparing behavior between different agent versions

Prerequisites

  • Braintrust account with access to your project
  • BRAINTRUST_API_KEY environment variable set
  • Basic familiarity with the LLM agent architecture

Accessing Traces

Step 1: Run Evaluation with Tracing Enabled

To debug an agent, first run an evaluation with tracing enabled:

# For running a specific evaluation with tracing enabled
export BRAINTRUST_TRACING=true
python -m trially_agents.patient_matcher

Step 2: Navigate to Experiment Results

  1. Visit the Braintrust dashboard: https://www.braintrust.dev
  2. Select your project (e.g., patient_matcher_development)
  3. Find your experiment in the list and click on it

Step 3: View Experiment Details

In the experiment view, you'll see:

  1. Overall metrics and scores
  2. A list of test cases with inputs, outputs, and scores
  3. Trace information for each test case

Step 4: Analyze Trace Data

For a specific test case go to the "Logs" tab:

  1. Click on the test case row to expand it
  2. Look for the "Trace" tab in the expanded view
  3. This shows all spans of the LLM API calls with detailed information

Understanding Span Information

Each span contains:

  • Input: The input to the LLM
  • Output: The output of the LLM
  • Expected: The expected output
  • Metadata: Context information about the request

Troubleshooting Common Issues

Missing Trace Data

If trace data is missing:

  1. Verify BRAINTRUST_API_KEY is correctly set
  2. Ensure BRAINTRUST_TRACING=true is set on LLM calls
  3. Check if you're using the traced client:
    # Use traced client (captures data)
    response = prompt.invoke({"input": "query"})
    
    # Instead of regular client (doesn't capture data)
    response = prompt.invoke({"input": "query"}, trace=False)