Skip to content

Verifiable Systems for Agents

Zakariasson's most-underappreciated component of the software-factory: how the agent checks its own work. Without verification, agents drift and humans become the bottleneck again.

Layers of verification

  1. Unit + integration tests — the agent writes and runs them. In Zakariasson's music-agent demo, the agent self-authored Playwright end-to-end tests that spawn browsers, click by data-testid, and validate after every change.
  2. Computer-use verification — agent controls the screen, records a video of itself testing, returns the video. Cursor launched this for cloud agents; Zakariasson called it "an AGI moment" when it worked internally.
  3. Automated review — Cursor's bugbot reviews PRs on GitHub and comments back. Plus "ask the agent to just review the changes it made."
  4. Screenshot sweeps — for UI-consistency checks, agent opens every page referencing a changed concept, screenshots all of them, returns for human visual-diff.

Backend vs frontend asymmetry

  • Backend: clear contracts, easy to verify. API returns expected shape, tests pass, done.
  • Frontend / UI: much harder. "You actually need to click around and make sure things work — the buttons actually have a loading spinner, etc." This is why computer-use verification unlocked a step-change.

This asymmetry explains why L5–L6 on levels-of-autonomy-shapiro is reached first in backend work.

Why "unsolved"

User-acceptance testing ("does this look right, feel right, stay consistent across pages?") is not reducible to deterministic checks. Zakariasson's answer: use cloud agents like QA consultants — give them navigation instructions the way you'd brief a human QA, let them execute, review the video.

Cross-references

Connects to