Vivaria
Vivaria is METR's tool for running evaluations and conducting agent elicitation research. Vivaria is a web application with which users can interact using a web UI and a command-line interface.
Demo
Getting started
See here for a tutorial on running Vivaria on your own computer using Docker Compose.
Features
- Start task environments based on METR Task Standard task definitions
- Run AI agents inside these task environments
- Powerful tools for performing agent elicitation research
- View LLM API requests and responses, agent actions and observations, etc.
- Add tags and comments to important points in a run's trajectory, for later analysis
- Quick feedback loop for "run agent on task, observe issue, make change to agent or reconfigure it, repeat"
- Run results are stored in a PostgreSQL database, making it easy to perform data analysis on them
- Sync run data to Airtable to easily build dashboards and workflows
- Built-in playground for testing arbitrary prompts against LLMs
- Authentication and authorization using Auth0
Screenshots
The Vivaria runs page, displaying a list of recent runs.
A Vivaria run page, showing details for a particular run.
The Vivaria playground, where users can test arbitrary prompts against LLMs.
Security issues
If you discover a security issue in Vivaria, please email vivaria-security@metr.org.
Versioning
The METR Task Standard and pyhooks follow Semantic Versioning.
The Vivaria server's HTTP API, the Vivaria UI, and the viv CLI don't have versions. Their interfaces are unstable and can change at any time.
Contact us
We encourage you to either file an issue on the Vivaria GitHub repo or email vivaria@metr.org.