Model Evaluation

Introduction

We use model evaluation to determine which models can deliver good performance and how they compare to other models. Model evaluation in the context of LLMs lets us assess how a specific model or agent can produce accurate, coherent, consistent answers. This process involves systematically testing our models against a series of benchmarks to ensure they meet our standards during development and production.

Why is Model Evaluation Important?

Performance Benchmarking: Benchmarking provides a framework for objectively assessing new and existing models, ensuring that our evaluations are based on data and performance metrics rather than subjective criteria.

Quality Assurance: Model evaluation ensures that the models produce accurate, relevant, and coherent responses. It's vital to maintaining high standards in our models' outputs.

Model Improvement and Optimization: Regular evaluation helps in identifying areas where models can be improved, leading to continuous enhancement of their performance.

How We Evaluate Models

Model-Level Tests

Initially, we assess how well a model performs specific tasks crucial to our objectives, such as assessing a patient's eligibility for a clinical trial. This step filters out models that don’t meet baseline requirements.

Agent-Level Tests

Post initial screening, we run agent-specific tests. These are designed to evaluate the nuanced capabilities of each model, particularly focusing on the ability to digest the context to accurately classify and explain the decision behind the classification for patient-trial pairs.