Verifiable Systems for Agents¶
Zakariasson's most-underappreciated component of the software-factory: how the agent checks its own work. Without verification, agents drift and humans become the bottleneck again.
Layers of verification¶
- Unit + integration tests — the agent writes and runs them. In Zakariasson's music-agent demo, the agent self-authored Playwright end-to-end tests that spawn browsers, click by
data-testid, and validate after every change. - Computer-use verification — agent controls the screen, records a video of itself testing, returns the video. Cursor launched this for cloud agents; Zakariasson called it "an AGI moment" when it worked internally.
- Automated review — Cursor's
bugbotreviews PRs on GitHub and comments back. Plus "ask the agent to just review the changes it made." - Screenshot sweeps — for UI-consistency checks, agent opens every page referencing a changed concept, screenshots all of them, returns for human visual-diff.
Backend vs frontend asymmetry¶
- Backend: clear contracts, easy to verify. API returns expected shape, tests pass, done.
- Frontend / UI: much harder. "You actually need to click around and make sure things work — the buttons actually have a loading spinner, etc." This is why computer-use verification unlocked a step-change.
This asymmetry explains why L5–L6 on levels-of-autonomy-shapiro is reached first in backend work.
Why "unsolved"¶
User-acceptance testing ("does this look right, feel right, stay consistent across pages?") is not reducible to deterministic checks. Zakariasson's answer: use cloud agents like QA consultants — give them navigation instructions the way you'd brief a human QA, let them execute, review the video.
Cross-references¶
- software-factory — verification is one of four components
- parallel-agent-competitions — tests are how you pick the winner
- driving-into-mud — what happens when verification is absent on long runs
- jagged-intelligence — why backend reaches autonomy first
Connects to¶
- pipeline-as-verifier — system-side complement to agent-side self-check.