Problem it solves
When one agent in a network fails, users either see the whole run fail or don't see the failure at all. Neither is correct.
When to use
Whenever a subagent fails during a multi-agent run and other agents in the network are still running.
When not to use
For single-agent failures where there is no network to isolate from.
Governing principle
One agent's failure does not automatically invalidate the run — but it makes the run's validity conditional. The human decides whether to continue, not the orchestrator.
Required Components
Interaction Flow
Subagent failure detected
Agent N fails during execution.
Failure isolated
The orchestrator isolates the failed agent from the running network. Downstream agents that depend on it are paused.
Network Degraded State surfaces
The UI transitions to a degraded state, showing which agent failed, what it was working on, and which downstream agents are affected.
Blast radius assessment
The surface shows the failure's blast radius: which agents are blocked, what outputs will be missing, and whether the run is still valid without this agent's output.
Human decides
The user chooses to continue (accepting the missing output), attempt recovery on the failed agent, or abort the run.
Decision logged
The failure, blast radius assessment, and human decision are logged.
Governance requirements
Partial failures are governance events. The failure, its blast radius, and the human decision must be logged. "Continue without this agent" decisions must document what output will be missing.
Accessibility notes
Degraded state must be announced via role="alert". The transition from normal to degraded state must not rely solely on visual changes. Each affected agent's state must be individually surfaced.