> AI agent evaluation framework

Evaluate agents with
terminal precision

No server. No signup. Multi-objective scoring from YAML specs. Deterministic code judges + customizable LLM judges, version-controlled in Git.

Get Started GitHub

agentv

$ agentv eval ./evals/math.yaml

Running 3 tests...

PASS addition score: 1.0

PASS multiplication score: 1.0

FAIL division score: 0.4

Results: 2 passed 1 failed

$ agentv compare run-a run-b

Comparing 2 runs...

correctness +12.5% (0.72 -> 0.81)

latency -340ms (1.2s -> 0.86s)

cost +$0.02 ($0.05 -> $0.07)

Overall: improved

Built for your workflow

Local Execution

No cloud dependency. All data stays on your machine. Zero overhead to get started.

[~]

Multi-Objective Scoring

Correctness, latency, cost, and safety measured in a single evaluation run.

{f}

Code + LLM Judges

Deterministic code validators and customizable LLM judges, composable and extensible.

LLM & Agent Targets

Direct LLM providers plus Claude Code, Codex, Pi, Copilot, OpenCode agent targets.

Rubric Grading

Structured criteria with weights and auto-generation. Google ADK-style object rubrics.

<=>

A/B Comparison

Compare evaluation runs side-by-side with statistical deltas and regression detection.

Quick Start

Install

npm install -g agentv

Initialize

agentv init

Configure

Copy .env.example to .env and add your API keys.

Create an eval

description: Math evaluation
execution:
  target: default

tests:
  - id: addition
    criteria: Correctly calculates 15 + 27 = 42
    input: What is 15 + 27?

Run

agentv eval ./evals/example.yaml

How AgentV Compares

Feature	AgentV	LangWatch	LangSmith	LangFuse
Setup	`npm install`	Cloud account + API key	Cloud account + API key	Cloud account + API key
Server	None (local)	Managed cloud	Managed cloud	Managed cloud
Privacy	All local	Cloud-hosted	Cloud-hosted	Cloud-hosted
CLI-first	✓	✗	Limited	Limited
CI/CD ready	✓	Requires API calls	Requires API calls	Requires API calls
Version control	✓ YAML in Git	✗	✗	✗
Evaluators	Code + LLM + Custom	LLM only	LLM + Code	LLM only

Evaluate agents withterminal precision