Risk and Failure Analysis

Overview

The Risk and Failure Analysis framework (T7) handles the case where the question is specifically how something could fail — a plan, a system, a design, a strategy. The user is not exploring the future generally (that is T6), not modeling an adversarial actor trying to defeat the artifact (that is T15 Red Team), and not investigating a failure that has already happened (that is T4 Causal Investigation). T7 takes a current artifact and examines its structural fragilities, vulnerabilities, and tail risks before they fire. The framework’s distinctive discipline is the refusal of generic system-failure tropes; the fragilities surfaced must be specific to this structure’s components and dependencies, not boilerplate language about single points of failure.

The framework runs two primary modes. Pre-Mortem Fragility (parsed from the legacy Pre-Mortem mode per Decision D — the action-plan variant pre-mortem-action lives in T6) operates in adversarial-future stance on a system, design, or institution. The mode adopts prospective hindsight — writing as though the breakage has already occurred and narrating it backward — and identifies structural fragilities specific to this structure’s components, traces load pathways from operating-envelope stresses to specific structural elements that yield, identifies leading indicators per fragility, and proposes structural mitigations distinguished from operational workarounds (fragility is a property of the structure rather than its operation). Fragility / Antifragility Audit is the thorough Talebian mode, classifying the system per Taleb’s three-class framework: fragile (concave response to stressor — losses dominate), robust (linear response — survives but does not gain), or antifragile (convex response — gains from volatility within a range). The audit identifies convex exposures, surfaces concave exposures explicitly (where small frequent gains hide rare catastrophic losses), distinguishes variance under normal conditions from tail-event response, applies via negativa (subtraction of fragility-creating elements rather than addition of robustness-creating compensating mechanisms), and holds the analyst’s own Talebian assumptions lightly rather than applying them mechanically.

The framework’s load-bearing intellectual content is the prospective hindsight stance, the structural-vs-operational distinction, the convex-vs-concave classification, and the via negativa discipline. Prospective hindsight is Klein’s methodological move — adopting the stance that the breakage has already occurred and narrating it backward defeats the optimism bias that hedged forward projection inherits. The structural-vs-operational distinction refuses the conflation of fragility (a property of the structure) with operational workaround (a temporary fix that does not change the underlying structure); a structural fragility addressed only operationally remains a structural fragility. The convex-vs-concave classification is Taleb’s central methodological contribution — the three-class system (fragile / robust / antifragile) is response-based, not forecast-based; the diagnostic question is not “what will happen” but “what does the system look like after a stressor relative to before.” The via negativa discipline says when the goal is robustness, subtraction of fragility-creating elements often beats addition of robustness-creating compensating mechanisms; the compensating mechanism may itself be fragile or introduce concave exposure not previously present.

The framework resists four patterns. Stance slippage where prospective-hindsight drifts into hedged forward projection — counteracted by the imagined-breakage-narrative requirement. Generic fragility tropes named without structural specificity — counteracted by the structural-specificity requirement. Antifragility collapse where antifragility is read as merely robustness — counteracted by the convex-response discipline. Talebian orthodoxy where the analyst’s priors are applied mechanically — counteracted by the lightness-of-assumption discipline.

The framework answers questions like: Where could this system design break? Before we ship this strategy, what fragilities exist? I want to know how this responds to volatility, not just whether it survives normal conditions. What’s the asymmetric exposure I’m not seeing? If this fails, where would the failure most likely be?

Systemic context

Risk and Failure Analysis sits in the decision-future-and-risk cluster of the Ora analytical-territory architecture. T7 is structurally distinct from T6 (Future Exploration), which operates more broadly than failure; from T15 (Red Team), which models an adversarial actor trying to defeat the artifact rather than asking where it could break under any pressure; from T4 (Causal Investigation), which engages with failures that have already happened; and from T18 (Strategic Interaction), which handles failures of strategic equilibrium rather than structural fragility. The Pre-Mortem operation is parsed across T6 and T7 per Decision D — the two parsed siblings (pre-mortem-action in T6; pre-mortem-fragility in T7) share the klein-pre-mortem lens but differ in object (action plan vs. system/design). T7 hooks sideways to T15 when an adversary is genuinely in the picture, to T4 when failure has already occurred, and to T18 when the failure is strategic-interaction.

Ora’s output

Example question: I’m about to ship a new architecture for our notification system. It’s a microservices design with a Redis-backed queue, three worker pools, and a Postgres backing store. Before we ship, I want a fragility/antifragility audit. Where could this break? What’s my concave exposure?

Ora’s response (running fragility-antifragility-audit):

The seed names a system (notification architecture), a question shape (fragility/antifragility audit), and an explicit ask for concave exposure identification. T7 mode is fragility-antifragility-audit. I’ll work through the nine required sections.

System or strategy locked. Notification system: microservices design with Redis-backed queue (the message broker between producers and worker pools), three worker pools (one per notification class — transactional, marketing, system-alerts), Postgres backing store (notification log, retry state, deduplication keys). Operating envelope: typical load 10K notifications/minute, peak 50K, with daily marketing-campaign spikes that can reach 200K over 5 minutes.

Stressor inventory. Small frequent: Redis connection drops; transient Postgres latency; worker restarts; network partitions. Rare large: Redis primary failure; Postgres failover; full-region cloud outage; coordinated marketing spike beyond peak. Regime changes: notification volume growth; regulatory shifts requiring synchronous delivery; provider-side rate-limit changes. Slow drift: Redis memory pressure; Postgres index bloat; worker memory leaks.

Convex exposures identified. None. There is no mechanism by which notification-system stress produces capability gains. Normal for operational infrastructure; antifragility is more commonly available in adaptive systems than in delivery infrastructure.

Concave exposures identified (where small frequent gains hide rare catastrophic losses). The load-bearing finding.

CE1 — Redis as single substrate for all queue traffic. Small frequent gain: simple architecture, low latency. Rare catastrophic loss: Redis primary failure produces complete notification stoppage across all three classes simultaneously. Deduplication keys are also lost on primary failure if persistence is not configured. Concavity: weeks of clean operation hide a single-event total stoppage.

CE2 — Postgres as backing store for all retry state. Rare catastrophic loss: Postgres failover that takes longer than worker pool retry timeout produces failed-notification cascade. Years of clean failovers hide one bad failover.

CE3 — Marketing-campaign spike as quasi-DDoS. Rare catastrophic loss: a coordinated campaign exceeding 200K/5min envelope by an order of magnitude saturates Redis memory, blocks transactional notifications behind marketing notifications in the shared queue, produces a transactional-delivery outage. Every successful campaign reinforces the assumption the envelope is sufficient.

CE4 — Provider-side rate-limit changes. Rare catastrophic loss: a provider unilaterally tightens rate limits; worker pools queue faster than they drain; Redis memory pressure builds; the queue substrate fails. Years of stable rate limits hide the dependency on a unilateral decision the system does not control.

Fragility / robustness / antifragility classification.

Fragile. CE1–CE4 each represent fragile exposures — concave response to specific stressor classes, loss tail underweighted by recent operational success.

Robust. The three-worker-pool design provides robustness against load-class-specific failures; a marketing-pool crash does not stop transactional notifications. Postgres deduplication and retry-state design provides robustness against worker-pool-level errors. Linear response — survives without gain.

Antifragile. None. The system does not improve from stress.

Tail-risk assessment. Dominant tail risks: CE1 (Redis primary failure during a marketing spike) and CE3 (marketing-envelope breach affecting transactional notifications). Both have the structural feature that consensus operations history would not predict the failure — clean operation makes the fragility invisible until the specific stressor arrives.

Asymmetric-payoff findings. The small-frequent-gain profile (low latency, simple operations, easy debugging) is paid for by concave exposures. The team wants clean operations with rare predictable failures; what they have is clean operations with rare catastrophic failures.

Via negativa recommendations (subtraction-first).

V1. Subtract the shared queue between marketing and transactional notifications. The robustness gain is not to add load-shedding logic on top of sharing; it is to remove the sharing so spikes structurally cannot affect transactional latency.

V2. Subtract the assumption that deduplication state must live in Redis. Move deduplication state to Postgres (already in use) and let Redis be a pure ephemeral queue. Removes the failure mode entirely.

V3. Subtract the assumption that any single provider is the substrate by adding a second provider per notification class. (Addition rather than pure subtraction; the via-negativa discipline accepts this when the alternative is a fragile compensating mechanism.)

Confidence per finding. High on CE1 and CE2 (dependency structures explicit). Moderate on CE3 (depends on operational assumptions). Lower on CE4 (provider-side behavior is a known unknown). Classification is high confidence; via negativa recommendations are working hypotheses requiring engineering review.

That is what fragility-antifragility-audit produces. The system is classified per the three-class framework; concave exposures are surfaced explicitly; via negativa recommendations are subtraction-first; the framework’s own Talebian priors are visible but not mechanically applied.

Commercial AI comparison

Comparison content auto-populates when the comparison-refresh framework runs against this question. Drafters do not author this section.

Brief comparison commentary

Auto-populates with the comparison content above.

How to use this framework

You can run the Risk and Failure Analysis pattern with any AI of your choice. The composition is single-pass for either mode.

The prompt:

[Paste the framework specification]

Run [pre-mortem-fragility / fragility-antifragility-audit].

System or design: [Plain-language description; include the architecture, the dependencies, the operating envelope.]

Operational history (optional): [What has run cleanly; what has stressed the system; what surprised you.]

Specific concern (optional): [If you have a particular fragility hypothesis, declare it up front so the analysis can either confirm or surface alternatives.]

The AI returns the mode-appropriate output: for pre-mortem-fragility, seven sections (imagined breakage narrative; structural fragility inventory; load pathways; leading indicators; structural mitigations; residual unmitigated fragilities; confidence); for fragility-antifragility-audit, nine sections (system locked; stressor inventory; convex exposures; concave exposures; classification; tail-risk assessment; asymmetric-payoff findings; via negativa recommendations; confidence).

For best results:

Disambiguate object — system or action plan. Pre-Mortem on a system uses T7’s pre-mortem-fragility; pre-mortem on an action plan uses T6’s pre-mortem-action. The two share the same lens but produce different output contracts.
Disambiguate stance — adversary or any pressure. If an adversary is genuinely in the picture (a competitor, an attacker, a regulator), Red Team in T15 is the right mode. T7 is the right mode when the question is structural fragility under any pressure, no adversary required.
Resist the compensating-mechanism temptation. When the framework recommends via negativa (subtract the fragility-creating element), resist the urge to add a compensating mechanism instead. Compensating mechanisms are themselves fragile and often introduce concave exposures.
Ask explicitly for concave exposures. Convex exposures (visible volatility) are easier to find than concave exposures (hidden tail risk masked by clean recent operations). The framework’s load-bearing finding is often in the concave-exposure section.

The framework is deliberately tool-agnostic. The prospective hindsight stance, the structural-vs-operational distinction, the convex-vs-concave classification, and the via negativa discipline are conceptual disciplines that survive the lift to any environment.

Other examples

Pre-Mortem Fragility on an organizational structure. A team is restructuring around a hub-and-spoke model. The framework adopts prospective hindsight, surfaces structural fragilities (single-leader bottleneck; lateral information blocked by hub bandwidth; concentrated succession risk; functional leaders compete for hub attention rather than coordinate), traces load pathways (the hub’s calendar becomes the lateral-coordination constraint), and proposes structural mitigations (lateral coordination forums) distinguished from operational workarounds (the hub working longer hours).
Fragility/Antifragility Audit on a personal financial portfolio. A portfolio that has performed well in a low-volatility regime. Fragile exposures (concentrated equity; leverage; unhedged currency); robust elements (cash, treasuries); potentially antifragile elements (long-volatility positions; capped-downside optionality). Concave exposures surfaced (steady equity appreciation hides tail risk); via negativa recommendations (subtract leverage; subtract currency exposure where not load-bearing).
Pre-Mortem Fragility feeding into Decision Clarity Analysis. A user is considering a major infrastructure decision. T7 surfaces the structural fragilities of each candidate; DCA then handles the wicked-problem multi-stakeholder tradeoffs. Canonical T7-then-DCA sequence when the decision involves both structural risk and stakeholder values.

Citations

The Risk and Failure Analysis framework draws on three convergent traditions. Klein’s “Performing a Project Premortem” (2007) supplies the prospective hindsight method — imagine the system has broken at some future point; narrate the breakage backward. Klein’s Sources of Power (1998) provides the broader naturalistic-decision-making context. Reason’s Human Error (1990) and the Swiss-cheese model supply the defensive-layer failure analysis used optionally when fragility crosses multiple layers. Perrow’s Normal Accidents (1984) and Petroski’s To Engineer Is Human (1985) supply the failure-engineering tradition that treats failure as informative rather than as deviation from a perfect plan.

Taleb’s The Black Swan (2007), Antifragile (2012), and Skin in the Game (2018) supply the convex/concave/antifragile vocabulary, the via negativa discipline, the barbell strategy, and the Lindy effect. Mandelbrot and Hudson’s The (Mis)behavior of Markets (2004) supplies the fat-tailed-distribution underpinning. The framework holds Taleb’s priors lightly rather than mechanically — markets-are-fat-tailed and expert-prediction-is-poor are working assumptions that have evidentiary support but are not facts; the lightness-of-assumption discipline is the framework’s own contribution.

The Pre-Mortem parse (Decision D, 2026-05-01 architecture lock) split the legacy Pre-Mortem mode into two parsed siblings — pre-mortem-action in T6 (operating on the action plan) and pre-mortem-fragility in T7 (operating on the system or design). Both share the klein-pre-mortem lens but produce different output contracts; routing distinguishes by what is being examined. The framework is currently at v1.0 (compiled 2026-05-01) with two resident modes (pre-mortem-fragility atomic; fragility-antifragility-audit atomic). The Failure Mode Scan and Fault Tree modes are deferred per CR-6.

Downloads

Framework specification (PDF) — link to ora-ai.org canonical artifact when published
Framework specification (plain text) — link to ora-ai.org canonical artifact when published
Full white paper (PDF) — link when published