Skip to content

Niels Bantilan

Chief ML Engineer at union-ai, core maintainer of Flyte (open-source workflow orchestration, 5+ years). Author of UnionML, creator of Pandera. Customers include LinkedIn, Stripe, Spotify, Mistral.

Grounded thesis

"What is the orchestration stack for observable, debuggable, and durable agents?" — framed through his own agent-adoption timeline (2022 prototyping → 2023 fine-tuning → 2024 LangChain apps → 2025 Cursor/Claude Code → 2026 "year of agents"). Flyte 2 is Union's answer: "durable, fully dynamic, crash-proof… infrastructure-aware orchestration."

Grounded quotes

  • On the failure surface: "Containers go boom. Your nodes are killed by the scheduler, spot instances are preempted. Then ultimately these kinds of infra failures lead to agent memory loss or corruption, and this wipes out precious context."
  • On the insight: "Agents can actually recover reliably from all the errors in this stack that I mentioned earlier, even infrastructure level ones, but only if you give them the right hooks."
  • On replay logs: "A replay log is essentially a service that records the state of an agent and its subtasks at a super granular level at each step… you can avoid re-executing tasks, prevents memory and context loss."
  • Case study (agentic deep research system): "Checkpoint-based recovery… they use spot instances heavily to reduce costs, and it became basically a non-issue. Spot instances would go away, they would come back a few seconds later."

See also