Comparison with Inspect

Inspect is a framework for large language model evaluations created by the UK AI Safety Institute.

Both Vivaria and Inspect are under active development, so this page may be out-of-date. This page was last updated on August 3, 2024.

Similarities

Vivaria and Inspect both allow running AI agents on tasks.
Both allow defining configurable agents that can take arbitrary actions to solve problems, e.g. using tools and making complex chains of LLM API calls.
Both provide web UIs for viewing the trajectory of an agent when run on a task.
Both provide a way for humans to use an agent's scaffolding to solve a task.
- Vivaria has built-in support for humans choosing an agent's next action.
- An Inspect user could write a Solver that prompts a human to choose the agent's next action.

Design and philosophical differences

Client-server vs. command-line

Vivaria uses a centralized, client-server model. At METR, Vivaria users interact with one of a few dedicated Vivaria server instances using the Vivaria UI and CLI. By contrast, Inspect is a command-line tool first and foremost. Inspect users collaborate by sharing run logs with each other, rather than by using the same centralized web server. In this way, Inspect is more similar to the METR Task Standard workbench than to Vivaria.

Note: The UK AI Safety Institute does have internal tools for orchestrating multiple instances of Inspect.

Agent sandboxing

Vivaria runs agents inside Docker containers: the primary machines of METR Task Standard task environments. Vivaria agents can run arbitrary code inside the task environment. Vivaria relies on Docker to prevent agents from running arbitrary code on the machine hosting the Docker containers. Also, the METR Task Standard allows writing, and Vivaria can run, tasks that prohibit the agent from accessing the internet. Vivaria has code to support running no-internet tasks that use aux VMs. However, this code is untested.

By contrast, Inspect doesn't intend for agents to run arbitrary code on the machine where Inspect is running. Instead, Inspect provides sandbox environments where agents can run arbitrary code. (Inspect does have a local sandbox environment, for writing agents that can run arbitrary code on the machine where Inspect is running. It's up to agent authors to sandbox agents that use the local sandbox environment, e.g. by running them in a Docker container.) Inspect supports constructing sandbox environments that are isolated from the internet.

Task format differences

Terminology

The METR Task Standard refers to "tasks" and "task families", which are groups of tasks. Inspect refers to "samples" and "tasks", which are groups of samples.

Another way to put it is: An Inspect sample is roughly equivalent to a Task Standard task. An Inspect task is roughly equivalent to a Task Standard task family.

Support for non-agentic evaluations

Inspect has native support for evaluating LLMs on one-shot benchmarks that don't involve tool use. By contrast, agents are integrated deeply into the METR Task Standard.

It's possible to write a Vivaria agent that supports evaluating LLMs on such benchmarks. For example, you could write an agent that takes a task's instructions, prompts an LLM with them, extracts a solution from the completion, and returns the solution to Vivaria. The "agent" wouldn't use multi-turn dialog or tools. However, Vivaria isn't built for this purpose.

in the future, METR may rework Vivaria into a tool for orchestrating instances of Inspect. If this happens, Vivaria would support all tasks that Inspect supports, including one-shot benchmarks without tool use.

Features Vivaria has that Inspect doesn't

Constructing task environments from task definitions that conform to the METR Task Standard.
- The UK AI Safety Institute has an internal library that allows exposing Task Standard task environments as sandbox environments. The may publish this library in the future.
Built-in support for agents that generate multiple possible next actions, then use an LLM to choose between them.
The ability to request that an agent generate more possible actions at one of these decision points.
Human oversight: Before an agent chooses its next action, a human can decide whether or not to execute the action.
- Inspect will soon support this kind of human oversight.
The ability to return to a previous point in an agent's trajectory, modify the agent's state or settings, and rerun the agent from there.
Agent usage limits based on tokens used, token cost, time, and actions taken.
- Inspect will soon have support for token limits.
The ability for humans to rate, tag, and comment on both trace entries and the agent's possible next actions at a decision point, for later analysis.
A set of CLI commands for creating task environments with no agent present. Users can perform manual quality assurance on these task environments. METR also hires human experts to solve tasks within these task environments.
- It's possible to write an Inspect solver that sleeps forever. This would allow a human to access the task's sandbox environments and either perform QA or solve the task manually.

Features Inspect has that Vivaria doesn't

Evaluating LLMs on tasks defined in the Inspect format.
Exporting evaluation log files in a standardized format. Users can share these log files with anyone, not just people with access to the same centralized web server.
A VS Code extension for running and debugging tasks, getting task metadata, and viewing evaluation log files.