Verifiable Systems for Agents¶

Zakariasson's most-underappreciated component of the software-factory: how the agent checks its own work. Without verification, agents drift and humans become the bottleneck again.

Layers of verification¶

Unit + integration tests — the agent writes and runs them. In Zakariasson's music-agent demo, the agent self-authored Playwright end-to-end tests that spawn browsers, click by data-testid, and validate after every change.
Computer-use verification — agent controls the screen, records a video of itself testing, returns the video. Cursor launched this for cloud agents; Zakariasson called it "an AGI moment" when it worked internally.
Automated review — Cursor's bugbot reviews PRs on GitHub and comments back. Plus "ask the agent to just review the changes it made."
Screenshot sweeps — for UI-consistency checks, agent opens every page referencing a changed concept, screenshots all of them, returns for human visual-diff.

Backend vs frontend asymmetry¶

Backend: clear contracts, easy to verify. API returns expected shape, tests pass, done.
Frontend / UI: much harder. "You actually need to click around and make sure things work — the buttons actually have a loading spinner, etc." This is why computer-use verification unlocked a step-change.

This asymmetry explains why L5–L6 on levels-of-autonomy-shapiro is reached first in backend work.

Why "unsolved"¶

User-acceptance testing ("does this look right, feel right, stay consistent across pages?") is not reducible to deterministic checks. Zakariasson's answer: use cloud agents like QA consultants — give them navigation instructions the way you'd brief a human QA, let them execute, review the video.

Cross-references¶

software-factory — verification is one of four components
parallel-agent-competitions — tests are how you pick the winner
driving-into-mud — what happens when verification is absent on long runs
jagged-intelligence — why backend reaches autonomy first

Connects to¶

pipeline-as-verifier — system-side complement to agent-side self-check.