Partial Failure Isolation · Multi-Agent Governance

Problem it solves

When one agent in a network fails, users either see the whole run fail or don't see the failure at all. Neither is correct.

When to use

Whenever a subagent fails during a multi-agent run and other agents in the network are still running.

When not to use

For single-agent failures where there is no network to isolate from.

Governing principle

One agent's failure does not automatically invalidate the run — but it makes the run's validity conditional. The human decides whether to continue, not the orchestrator.

Required Components

HC10 Network Degraded State NEW HC06 Recovery & Override

Interaction Flow

Subagent failure detected

Agent N fails during execution.

Failure isolated

The orchestrator isolates the failed agent from the running network. Downstream agents that depend on it are paused.

Network Degraded State surfaces

The UI transitions to a degraded state, showing which agent failed, what it was working on, and which downstream agents are affected.

Blast radius assessment

The surface shows the failure's blast radius: which agents are blocked, what outputs will be missing, and whether the run is still valid without this agent's output.

Human decides

The user chooses to continue (accepting the missing output), attempt recovery on the failed agent, or abort the run.

Decision logged

The failure, blast radius assessment, and human decision are logged.

Governance requirements

Partial failures are governance events. The failure, its blast radius, and the human decision must be logged. "Continue without this agent" decisions must document what output will be missing.

Accessibility notes

Degraded state must be announced via role="alert". The transition from normal to degraded state must not rely solely on visual changes. Each affected agent's state must be individually surfaced.