Hypothesis Evaluation

Overview

The Hypothesis Evaluation Framework (T5 in Ora’s territory map) is the framework for adjudicating among multiple competing explanations of a body of evidence. It does not test single hypotheses (that is causal investigation territory); it does not adjudicate inter-paradigm disagreement where the dispute is really about how to frame the issue (that is paradigm examination territory); it does not choose between action alternatives (that is decision-making territory). It is the framework that activates when there are two or more plausible explanations on the table and the question is which fits the evidence best.

The framework runs in three modes. Differential-diagnosis is the lightest mode — a quick weigh-in among 2–5 candidate explanations, organized around diagnosticity (what would distinguish them) rather than surface plausibility (which sounds most likely). The mode names disconfirming tests per top candidate so the user can act to narrow further; it deliberately includes rare-but-serious “zebra” candidates so common-case explanations do not eclipse them. Competing-hypotheses is the canonical Heuer ACH mode — a thorough analysis with at least three hypotheses (including at least one analyst-generated alternative beyond the user’s set, plus a null/“something else” hypothesis), at least three evidence items with credibility and relevance ratings, and the evidence-by-hypothesis matrix populated cell-by-cell using Heuer vocabulary (CC = strongly consistent, C = consistent, N = neutral, I = inconsistent, II = strongly inconsistent, NA = not applicable). Bayesian-hypothesis-network is the molecular composition that builds an explicit Bayesian network with hypotheses as nodes carrying priors, evidence-items as nodes with likelihoods, and conditional dependencies between hypotheses named explicitly.

The framework’s load-bearing intellectual content is the work-across-the-matrix discipline, the conclusion-by-elimination rule, the diagnosticity-vs-consistency distinction, and the deception scan for adversarial contexts. The work-across-the-matrix discipline is Heuer’s central reversal: the human tendency is to work down the matrix — collect evidence for the favoured hypothesis, accumulate confirmations, ignore evidence that doesn’t fit. ACH inverts this: for each evidence item, ask what it implies for every hypothesis. The conclusion-by-elimination rule says the surviving hypothesis is the one with the fewest I+II cells (tie-broken by II count) — the conclusion is elimination of least-consistent, not confirmation of favoured. The diagnosticity-vs-consistency distinction says evidence that rules out a hypothesis is high-diagnostic; evidence that is merely consistent with a hypothesis is low-diagnostic; treating consistency as confirmation is a central failure mode.

The framework is honest about what makes the matrix work. The matrix is the load-bearing structure — rows are evidence (with stable IDs E1, E2…), columns are hypotheses (with stable IDs H1, H2…), cells use Heuer vocabulary. Custom vocabulary (“supports,” “refutes,” “weakly indicates”) gets converted to Heuer vocabulary because the cell tally has to be recountable from the prose verdict. The arithmetic must match the prose; if it doesn’t, recount or revise the prose. NA is explicit (use it when evidence does not bear on a hypothesis); never leave cells blank. The discipline is unglamorous but the matrix’s value depends on it.

The framework includes a deception scan for contexts where adversarial actors are plausible. Where high-diagnosticity evidence could have been manufactured, the analysis assesses whether it was — the central failure mode of intelligence-style ACH is rewarding the deceiver who plants high-diagnostic evidence supporting their preferred hypothesis. The deception check is not paranoid; it is calibrated to whether the context contains parties with motive and means to manufacture evidence.

The framework answers questions like: I have three candidate explanations for this and I want to weigh them quickly. Multiple plausible explanations for the same evidence — which fits best? I have a favoured theory but I want it stress-tested against alternatives. The evidence is ambiguous and I want a probabilistic read with priors. The hypotheses depend on each other and I need to see how priors propagate through the network.

Systemic context

Hypothesis Evaluation is the explanation-adjudication territory of Ora’s territory map (T5, super-cluster B — Causation, Hypothesis, and Mechanism). It sits adjacent to T1 (Argumentative Artifact, where each hypothesis might be a complete argument needing audit), T4 (Causal Investigation, where the hypotheses are causal hypotheses about a single historical event), and T9 (Paradigm Examination, where the disagreement is really about how to frame the issue rather than which hypothesis fits the evidence). Cross-territory dispatch from T5 typically routes to T1 when each hypothesis is itself a complete argument-as-artifact (audit each), to T4’s process-tracing when competing hypotheses are causal hypotheses about a single historical event (process-tracing uses Bennett-Checkel evidence-tests as its diagnosticity apparatus), and to T9 when the doctor-vs-homeopath case applies (different paradigms entirely, not within-frame hypothesis weighing). The bayesian-hypothesis-network mode composes both differential-diagnosis (as a breadth-seed fragment) and competing-hypotheses (full) and synthesizes via a Bayesian network construction stage.

Ora’s output

Example question: A user reports that their meditation app stopped sending notifications about a week ago. I have three candidate explanations: (1) the app’s notification permission got revoked by an iOS update; (2) the user accidentally disabled notifications themselves; (3) a server-side bug in our notification scheduler is dropping their account specifically. Help me weigh these.

Ora’s response (running Hypothesis Evaluation in differential-diagnosis mode):

The question fits the differential-diagnosis frame — three candidate hypotheses, one body of evidence (a single user’s report), time-pressured (you presumably want to act on this within the day, not run a multi-day investigation). I’ll work the case with the four-step differential-diagnosis schema, with one zebra added.

Step 1 — Candidate hypotheses with one-line characterizations.

H1 (iOS permission revoked). An iOS update or system reset revoked the notification permission for the app; the app cannot deliver notifications until the user re-grants it. Common after iOS major-version updates.

H2 (user-side disable). The user disabled notifications themselves — either explicitly in iOS settings, by tapping “don’t notify me” on a notification, or via the app’s own preferences. Common when notifications became annoying enough that the user silenced them and forgot.

H3 (server-side scheduler bug). The notification scheduler on your backend has a bug that is dropping this specific user’s notifications. Common when sharded scheduling code has user-specific edge cases.

H4 (zebra — silent-mode + Do Not Disturb interaction). The user has Do Not Disturb scheduled for hours that overlap your scheduled notification windows; notifications are being delivered but routed to the silent-mode notification center where the user does not see them. Less common but worth surfacing because it produces the same surface symptom.

Step 2 — Evidence observed and what each piece implies.

E1. “Stopped about a week ago.” → Discrete change point. H1 likely (iOS update timing); H2 possible; H3 possible (deploy timing); H4 weaker.

E2. No other users have reported the same problem. → Cause is user-specific. H1 less likely; H2 likely; H3 possible if sharded; H4 likely.

E3. The user can still open the app normally. → App otherwise functional. Doesn’t discriminate.

Step 3 — Diagnosticity assessment for top two. Top two by surface plausibility: H2 (E2’s user-specificity) and H1 (E1’s change point). The iOS-settings notification-permission state rules out H1 if notifications are still permitted; combined with checking the app’s own preferences, the two checks decisively rule H1 and H2 in or out.

Step 4 — One disconfirming test per top candidate. For H2: check the app’s own notification preferences. For H1: check iOS Settings → Notifications → [your app] — if permitted there, H1 is ruled out. Zebra check (H4): ask whether the user has Do Not Disturb scheduled overlapping your typical notification windows. Single question; if yes, the diagnosis collapses to H4 directly.

Ranking with confidence.

H2 (user-side disable) is most likely given E2’s user-specificity, but the ranking is fragile because E1’s discrete change point is more consistent with H1. H1 is second-most-likely. H3 is least-likely (would typically affect more than one user, and you would have other reports). H4 is unlikely but worth ruling out cheaply.

Confidence: low-to-medium. The evidence base is thin (one user, three observations); a confident ranking is not warranted. The two disconfirming tests plus the single zebra check resolve the ranking decisively at the cost of three user-side observations. Run them before acting.

That is what differential-diagnosis produces: four candidate hypotheses including one zebra, three evidence items with what-each-implies analysis, diagnosticity assessment for the top two, one disconfirming test per top candidate, and an explicit zebra check. The ranking is honest about its sparseness; the action recommendation is to gather the three observations rather than to act on the current ranking. The conclusion is elimination of least-consistent on observation, not confirmation of favoured by current evidence.

Commercial AI comparison

Comparison content auto-populates when the comparison-refresh framework runs against this question. Drafters do not author this section.

Brief comparison commentary

Auto-populates with the comparison content above.

How to use this framework

You can run the Hypothesis Evaluation pattern with any AI of your choice. The composition is single-pass for any of the three modes.

The prompt:

[Paste the framework specification]

Run Hypothesis Evaluation on this question.

Question: [The thing you want explained.]

Mode (optional): [Differential-diagnosis / competing-hypotheses / bayesian-hypothesis-network. If not specified, the framework infers from the depth requested.]

Candidate hypotheses (optional): [If you have hypotheses already; the framework will generate at least one analyst-generated alternative beyond your set in any case.]

Evidence: [What you’ve observed.]

The AI runs the within-territory disambiguation first if the mode wasn’t specified — Q1 (depth) for “quick” / “systematic” / “probabilistic with priors” — and routes to the appropriate mode. The output is mode-shaped: a six-section ranking with disconfirming tests for differential-diagnosis, a populated matrix with sensitivity analysis and monitoring priorities for competing-hypotheses, a Bayesian network with posterior distribution and sensitivity ranking for bayesian-hypothesis-network.

For best results:

Provide the evidence honestly, including evidence that doesn’t fit your favoured hypothesis. The work-across-the-matrix discipline depends on the evidence base being honest. If you suppress evidence that disconfirms your favoured hypothesis, the matrix will reproduce the suppression and the ranking will be wrong.
Let the framework generate at least one hypothesis you didn’t propose. The analyst-generated alternative is the structural defence against missing-hypothesis bias. If you constrain the framework to your candidate set only, you forfeit the defence.
In competing-hypotheses, accept the conclusion-by-elimination framing. If the matrix says H2 is the leader because it has the fewest I+II cells, do not ask the framework to rewrite the prose to confirm H1 because H1 is your favoured. The arithmetic is the verdict; the prose serves the arithmetic.
For Bayesian mode, don’t fabricate priors. When priors cannot be elicited from defensible base-rate sources, document as flat-prior assumption rather than asserting point estimates. Fabricated priors propagate through the posterior calculation as if they were real, producing false precision.

The framework is deliberately tool-agnostic. The matrix vocabulary, the work-across discipline, the conclusion-by-elimination rule, and the diagnosticity-vs-consistency distinction are conceptual disciplines that survive the lift to any environment.

Other examples

competing-hypotheses on an organizational performance question. A team’s quarterly metrics declined; four hypotheses on the table (team morale; product-market mismatch; competitive entry; measurement error). The framework generates a fifth analyst-generated alternative (cumulative effect of policy changes during the quarter) and a sixth null hypothesis (“the decline is within normal quarterly variation”). Eight evidence items with credibility and relevance ratings; the matrix is populated cell-by-cell across all six hypotheses; the surviving hypothesis (fewest I+II cells) is the cumulative-policy-effect alternative the team had not generated. Sensitivity analysis identifies one evidence item whose reversal would swap the leader; monitoring priorities list three indicators to watch over the next quarter. Demonstrates the framework’s defence against missing-hypothesis bias and its commitment to elimination-framing.
bayesian-hypothesis-network on a forensic question. An investigation has three competing hypotheses about how a security breach occurred (insider threat; external compromise via phishing; supply-chain compromise via a vendor library). Priors elicited from base-rate sources (industry breach-cause statistics); evidence-likelihoods estimated per item; conditional dependencies named (a confirmed phishing artifact does not reduce the prior on insider threat — they could be conjoint, not exclusive). Posterior distribution computed; sensitivity analysis ranks evidence by posterior-shift magnitude. The deception check fires (an insider with admin access could plant evidence supporting the external-phishing hypothesis); one evidence item’s diagnosticity is downgraded in light of the manufacture risk. Demonstrates the molecular composition’s discipline including the deception scan.
differential-diagnosis with a zebra that turns out to be the diagnosis. A product team has three candidate explanations for a performance regression and one rare-but-serious zebra (a rarely-used database index has been silently dropping queries). The standard hypotheses fail their disconfirmation tests; the zebra check (a quick query plan inspection) confirms the index issue. Demonstrates the zebra-rule’s value — common-case explanations would have eclipsed the serious-but-rare diagnosis if the framework had not deliberately surfaced it.

Citations

The framework draws on three source traditions. The intelligence-analysis tradition contributes the Heuer ACH methodology — Heuer’s Psychology of Intelligence Analysis (1999) and Heuer and Pherson’s Structured Analytic Techniques (2010) are the substrate for the work-across-the-matrix discipline, the conclusion-by-elimination rule, the cell vocabulary, and the deception scan. The elimination-framing rather than confirmation-framing is Heuer’s central methodological move; the null/“something else” hypothesis is Heuer’s defence against missing-hypothesis bias.

The medical differential-diagnosis tradition contributes the lighter mode — Sackett et al.’s Clinical Epidemiology (1991) is the substrate for the diagnosticity-over-surface-plausibility discipline and the zebra-rule. The medical tradition contributes the recognition that in time-pressured contexts with sparse evidence, organizing analysis around what would distinguish candidates produces faster decisive observation than accumulating support for the favoured.

The Bayesian tradition contributes the molecular composition — Pearl’s Probabilistic Reasoning in Intelligent Systems (1988) is the substrate for the explicit Bayesian network. The flat-prior-documentation discipline rather than fabricated point priors draws on the calibration literature (Tetlock 2015). The framework was compiled 2026-05-01 from the territory map’s T5 entry; v1.0 with PFF-conforming structure throughout.

Downloads

Framework specification (PDF) — link to ora-ai.org canonical artifact when published
Framework specification (plain text) — link to ora-ai.org canonical artifact when published
Full white paper (PDF) — link when published