Peven
Peven is how I'm thinking about building structured evaluations for agents on long-horizon tasks. The core is a colored Petri net: places hold tokens, transitions move them, and the topology of the net defines who sees what, in what order, and what kind of interactions the evaluation permits.
Here's a rollout of gemma4:e4b navigating a MiniGrid DoorKey environment with a 5x5 egocentric view of the grid. It has to pick up a key, pass through a door, and reach the goal. At each step, Gemma can turn, move forward, pick up the key, open the door, or ask a planner model, deepseek-r1:7b, for help. The left column shows the current observation and rollout state; the net on the right shows the same rollout moving through state, model, tool, and executor nodes. See the example code here.