Architectural Invariants of Multi-LLM Deliberation Systems in Uncertain Decision Domains

Executive Conclusion

The proposition of utilizing a multi-agent Large Language Model (LLM) deliberation layer within a hybrid quantitative trading engine is empirically sound and highly credible, provided the architecture is strictly constrained. The overarching conclusion of this investigation is that this direction is real and operationalizable, but only when the LLM council is engineered as an epistemic uncertainty extractor and feature generator, rather than a consensus-seeking directional decision-maker.

Attempts to build autonomous, end-to-end LLM trading agents that predict directional market movements generally fail due to hallucination, correlated reasoning errors, and catastrophic collapse under domain shift. In high-cost, asymmetric-loss domains, relying on LLMs for direct execution is a critical vulnerability. However, when deployed to evaluate pre-existing, deterministically generated market setups—specifically to improve the rejection of fragile setups, calibrate downstream risk models, and execute structured abstention—a minimal deliberation protocol provides immense, quantifiable value.

The strongest design intuition supported by the current literature is that structured disagreement is a feature, not a bug to be smoothed over. Simple ensembling and majority voting discard the most valuable signal an LLM council can produce: the structural depth and semantic geometry of their divergence. By measuring the structure of disagreement (e.g., evidence overlap, divergence depth, minority argument strength) and linguistic hesitation cues, a multi-agent council can reliably flag confident but incorrect responses that single models cannot detect on their own.

Therefore, the minimal useful architecture does not require large swarms of conversational agents engaging in free-form debate. Instead, it requires a highly structured, role-diverse triad operating under strict isolation protocols before any interaction occurs. The output of this deliberative layer must never be a boolean trade decision. It must be a vector of continuous confidence, hesitation, and epistemic disagreement features passed directly to a deterministic risk and execution engine. When constrained to this feature-generation role, the deliberation layer demonstrably improves decision hygiene, reduces the false-positive rate of algorithmic signals, and provides a robust mathematical framework for automated abstention.

Evidence Map

The following matrix synthesizes the empirical evidence regarding multi-agent deliberation, categorized by domain, architectural pattern, and its transferability to a high-noise, asymmetric-loss domain like quantitative finance.

Source	Domain	Architecture Pattern	Core Support / Finding	Evidence Strength & Trading Transferability
`[5]`	Logical Reasoning	Controlled Multi-Agent Debate (MAD)	Intrinsic reasoning strength and group diversity dominate success. Majority pressure suppresses independent correction.	High. Demonstrates that simple voting fails and highlights the danger of sycophancy in LLM consensus. Directly translates to trading risk where false consensus is fatal.
	Uncertainty Quantification	DiscoUQ (Disagreement Structure Analysis)	Replaces vote counting with linguistic and geometric structure of disagreement (divergence depth, evidence overlap).	Very High. Directly applicable to generating pre-trade risk features. Proves that analyzing the "weak disagreement" tier drastically improves calibration.
`[7, 8, 9]`	Forecasting (Metaculus)	AI Crowds with Independent/Shared Info	Deliberation improves forecasting log-loss by ~4% only in diverse model ensembles. Homogeneous ensembles show no benefit or degradation.	High. Forecasting is closely adjacent to trading. Proves that foundational model heterogeneity is an absolute invariant for predictive gains.
	LLM Inference Efficiency	iMAD (Intelligent MAD)	Extracts 41 linguistic/semantic hesitation cues via self-critique to selectively trigger debate, saving compute and preventing degradation.	Moderate-High. The hesitation feature extraction is directly translatable to evaluating the internal fragility of a primary trading thesis.
	Factuality & Reasoning	Wald-SPRT Compute Governor	Uses a sequential probability ratio test on judge consensus scores to halt debate dynamically. Focuses on calibration of consensus.	Moderate. The SPRT methodology is highly rigorous, but the assumption of independent and identically distributed judge scores may not hold in non-stationary financial data.
`[13, 14, 15]`	Human Decision Hygiene	Adversarial Collaboration / Devil's Advocate	Visible consensus creates informational cascades. A dedicated, protected dissenter role prevents premature convergence.	High. Provides theoretical grounding for agent persona design. LLMs heavily mimic human cognitive biases such as sycophancy and confirmation bias.
	Autonomous ML Research	Aris (Adversarial Collaboration)	Executor drives progress; Reviewer (different model family) audits raw artifacts independently. Implements fallback diagnosis.	Very High. The architecture of reviewer independence (auditing raw data, not the executor's summary) is vital to prevent hidden information leakage in financial setups.
`[16]`	Automated Judgment	LLM Judges with Adaptive Stability	Models consensus dynamics via Beta-Binomial mixture and stops adaptively using Kolmogorov-Smirnov test.	Moderate. Advanced statistical treatment of consensus, though heavily optimized for standard natural language processing reasoning benchmarks rather than time-series data.
	Logical/Math Reasoning	Memory Masking	Masks erroneous memories from previous rounds to prevent compounding hallucinations in subsequent debate rounds.	Moderate-High. Critical for multi-round debate hygiene, preventing agents from anchoring on early logical flaws as established ground truth.

Where Value Arises in Debate Systems

To architect a robust multi-agent system, one must isolate exactly where the value of deliberation originates. The literature across human organizational psychology and machine epistemics points to a singular, counter-intuitive truth: value does not arise from the eventual consensus, but from the friction generated during the process of structured disagreement. Understanding this mechanism requires examining the human baselines that LLMs mimic and the specific architectural interventions that extract signal from noise.

The Human Baseline: Cognitive Hygiene and Informational Cascades

In human decision-making arenas characterized by uncertainty and asymmetric losses—such as investment committees, intelligence analysis, and medical diagnosis—the primary failure mode is pluralistic ignorance and the formation of informational cascades. When faced with deep uncertainty, individuals rationally look to the group for behavioral and informational cues. However, once a cascade begins, individuals suppress their private, contradictory information and conform to what they perceive as the consensus. This visible consensus becomes a heuristic shortcut for truth. The resulting group polarization leads to an escalation of commitment, moral certainty, and a refusal to update beliefs even when confronted with disconfirming evidence. In financial contexts, this manifests as doubling down on failing trades or ignoring structural macro shifts because the internal committee narrative has calcified.

To combat this, human decision hygiene relies on engineering structured friction into the process. The "Devil's Advocate" approach requires assigning a protected dissenter role, explicitly tasked with surfacing disconfirming evidence, offering alternative explanations, and challenging the base rate assumptions without fear of professional punishment. Similarly, adversarial collaboration forces proponents of competing theories to mutually agree on the empirical tests that would falsify their respective models before the data is gathered. The mathematical payoff for these mechanisms is derived from artificially lowering the "cost of dissenting," thereby allowing true private information to enter the public sphere.

Translating Human Hygiene to Machine Protocols

Large Language Models, having been trained on vast corpora of human text and fine-tuned via Reinforcement Learning from Human Feedback (RLHF), exhibit highly analogous failure modes. They suffer from sycophancy, positional bias, and verbosity bias, frequently collapsing into premature agreement when exposed to the outputs of other models. A homogeneous group of LLM agents will amplify their shared biases, creating an echo chamber that overrides correct single-agent answers with eloquent but mathematically unsound logic, entirely simulating human pluralistic ignorance. Therefore, the value in LLM debate arises exclusively from engineering the system to resist these natural linguistic attractors.

Value is generated through the rigorous application of independent first-pass reasoning. This is the foundational step of epistemic extraction. If agents share context during their initial reasoning phase, the system suffers immediate information leakage, and the diverse perspectives collapse into a unified, often flawed, narrative. The Aris framework for autonomous machine learning research demonstrates this principle clearly: an adversarial reviewer must access the raw, unsummarized artifacts. If the reviewer relies on the primary agent's framing or summary of the data, the reviewer silently inherits the primary agent's blind spots and confirmation bias. In trading, this means all agents in a council must independently process the raw deterministic setup features (e.g., order book imbalance, moving average crossovers) before seeing any other agent's interpretation.

Further value is unlocked through explicit role specialization and adversarial critique. Generalist agents instructed simply to "discuss" a problem fail to generate rigorous critique, often falling into agreeable platitudes. Value arises when agents are highly parameterized with adversarial personas. For instance, prompting one agent as a "Thesis Builder" and another as a "Systematic Verifier" enforces semantic divergence. This mirrors human adversarial collaboration by forcing the models to explore disparate areas of the probability space, ensuring that the final evaluation accounts for both the bull case and the structural vulnerabilities.

The Extraction of Epistemic Uncertainty

A single LLM's self-reported confidence is notoriously poorly calibrated and prone to severe overconfidence, rendering it useless as a standalone risk metric. Uncertainty in predictive models can be decomposed into aleatoric uncertainty (inherent input ambiguity or data noise) and epistemic uncertainty (knowledge gaps, parametric limitations, or model ignorance). The multi-agent council excels specifically at extracting this epistemic uncertainty.

When multiple scale-matched, diverse models are given the same setup, their semantic disagreement serves as a direct proxy for epistemic uncertainty. The DiscoUQ (Disagreement-Structure Confidence for Uncertainty Quantification) framework proves that simple vote counting (e.g., 3 out of 5 agents say "Trade") discards critical information. Instead, measuring the linguistic and geometric structure of the disagreement yields highly calibrated uncertainty estimates. For example, if two agents arrive at the same conclusion but rely on entirely different, non-overlapping evidence, this low "evidence overlap" is a powerful fragility indicator. Similarly, measuring "divergence depth"—whether the agents disagree on the fundamental premise of the market regime (early divergence) or merely on the optimal execution timing (late divergence)—provides a granular risk signal that is invisible to a single model.

Finally, value is maximized when the council's output is structured rather than presented as free-form prose. Unconstrained multi-agent debate introduces "persuasion override," where the most verbose, authoritative-sounding agent sways the council, regardless of factual accuracy. An Arbiter role should not output a narrative summary; it must output a structured vector of features. The iMAD protocol demonstrates this by extracting 41 specific linguistic and semantic features representing "hesitation cues" (e.g., hedging, contradictions, shallow reasoning) directly from the agents' self-critique. Structured outputs strip away the rhetorical varnish, leaving only the mathematical skeleton of the disagreement for downstream processing.

Architectural Invariants

Based on empirical evidence across reasoning benchmarks, forecasting tournaments, and uncertainty quantification frameworks, several design elements emerge as strict invariants for a high-stakes trading deliberation system. These are categorized by their degree of necessity to ensure system robustness and reliability.

1. Likely Essential Invariants

The most critical invariant is that heterogeneous model diversity matters far more than agent count. Adding more instances of the same model (e.g., deploying three instances of GPT-4o) provides zero or negative marginal gain in complex forecasting and reasoning tasks. A study utilizing 202 resolved binary questions from the Metaculus Q2 2025 AI Forecasting Tournament conclusively demonstrated that deliberation only improves accuracy (a ~4% relative reduction in Log Loss) when the council consists of diverse foundation models (e.g., mixing GPT, Claude, and Gemini). Homogeneous groups suffer from highly correlated errors, shared parametric blind spots, and an amplification of the same failure modes. A triad of highly diverse models will vastly outperform a homogeneous swarm of a dozen agents.

Strict independence before interaction is equally essential. Agents must formulate their initial thesis, reasoning chain, and confidence scores in complete isolation. If the designated Devil's Advocate agent observes the Thesis Builder's conclusion before forming its own baseline read of the market setup, anchoring bias irrevocably corrupts the process. The raw market data must be fed in parallel to all agents, ensuring that their initial analytical vectors are untainted by peer influence.

The architecture must treat disagreement as a first-class output signal, rather than a problem to be solved via consensus. The system must not force a unanimous verdict. The DiscoUQ framework highlights that the "weak disagreement" tier—where vote counting fails to capture the nuance of the debate—contains the highest informational value. If the council is forced into a binary consensus, this rich geometric and linguistic divergence is lost. The preservation of disagreement metrics is essential for generating downstream risk features.

Crucially, the final action must remain outside the council. The LLM council must never emit a direct execution command. Its mandate is strictly epistemic. It analyzes a proposed setup and outputs a multidimensional risk vector. The downstream deterministic risk engine weights these features (e.g., Epistemic Uncertainty Score: 0.85, Hesitation Index: 0.7) against portfolio constraints, volatility targeting, and execution costs to make the final "ACCEPT," "REJECT," or "ABSTAIN" decision. Abstain and veto mechanisms should be triggered deterministically by the risk engine based on the severity of the features generated by the council.

2. Likely Useful Invariants

Implementing hesitation cue extraction via self-critique is a highly efficient mechanism for early filtering. Before engaging multiple agents in a computationally expensive debate, forcing a single agent to output a structured self-critique—arguing for a plausible alternative to its own thesis—generates high-signal hesitation features. The iMAD framework uses this self-critique to extract 41 semantic cues to determine if multi-agent debate is even necessary, saving substantial compute. In trading, high internal hesitation on a first-pass analysis is a strong early indicator for an immediate "ABSTAIN" recommendation, bypassing the need for full council deliberation.

Memory masking for multi-round debate is another highly useful design element. If the council engages in more than one round of critique, it becomes vulnerable to "erroneous memory cascades," where a factual hallucination or logical flaw introduced in Round 1 is accepted as established fact in Round 2. The MAD-M² protocol demonstrates that explicitly masking or discarding erroneous context between rounds significantly improves the robustness of the debate, polishing the contextual information and discarding fallacious memories before the next interaction.

Compute governance via a Sequential Probability Ratio Test (SPRT) offers a rigorous method for resource allocation. Not all trading setups require deep deliberation; simple, obvious rejections can be handled efficiently. Adapting Wald's SPRT as a compute governor allows an Arbiter model to accumulate a log-likelihood ratio of consensus after each round. If the setup is cleanly valid or violently fragile, the monitor crosses the predefined decision boundary early, halting the debate and saving latency and token costs.

3. Uncertain Elements

The efficacy of complex, multi-round cross-chatter remains highly uncertain. The marginal value of debate drops precipitously after the first or second round of critique. Extended debates often devolve into sycophancy, repetitive semantic loops, or context-window degradation. The literature remains mixed on whether debate depths beyond two rounds provide any measurable alpha in reasoning tasks, with many studies suggesting that early divergence is a sufficient signal.

Similarly, allowing human-like "persuasion" mechanics is fraught with risk. Empowering agents to actively attempt to persuade one another often leads to the eloquence override failure mode, where a highly articulate model forces consensus on a mathematically flawed premise. Structured, isolated critique is far safer and more mathematically sound than conversational persuasion.

4. Probably Unnecessary for MVP

End-to-end consensus enforcement is actively detrimental and unnecessary for a Minimum Viable Protocol. Forcing the Arbiter to output a binary "Valid / Invalid" based on agent consensus discards the granular probability distributions required for quantitative risk management. Furthermore, massive agent swarms are unnecessary. There is no empirical evidence supporting the necessity of large councils consisting of seven, ten, or twenty agents. State-of-the-art results in both forecasting accuracy and uncertainty quantification saturate at three to five highly diverse agents.

Minimal Viable Protocol (MVP)

Based on the synthesis of architectural invariants, the following defines the smallest, most credible deliberation protocol worth implementing for a quantitative trading systems architect.

Protocol Objective

The objective of the MVP is to act as a "Fragility and Epistemic Uncertainty Engine." It receives a deterministically identified trading setup (e.g., comprised of technical indicators, order book state metrics, and macroeconomic context variables) and outputs a structured vector of risk features to be used by the execution layer for setup rejection, sizing penalty, or ranking.

Agent Topology and Roles

The MVP utilizes a strictly constrained "3+1 Architecture": Three diverse reasoning agents operating in parallel, overseen by one Arbiter/Extractor. To ensure the invariant of model heterogeneity, the agents must be instantiated from distinct foundation model families.

Agent Role	Recommended Model Class	Primary Directive
Agent 1: Thesis Builder	Claude 3.5 Sonnet / 3.7	Construct the strongest possible bullish or bearish argument for the proposed setup based on the data. Identify primary drivers and supporting historical analogues.
Agent 2: Adversarial Breaker	GPT-4o / o1	Act strictly as a Devil's Advocate. Assume the setup is a trap. Identify hidden correlations, base-rate fallacies, mean-reversion risks, and contradictory evidence.
Agent 3: Systematic Verifier	Gemini 1.5 Pro	Ignore the market narrative entirely. Focus purely on structural mechanics, historical base rates, liquidity conditions, and tail-risk exposure.
Agent 4: The Arbiter	Llama-3-70B (or fine-tuned local)	Synthesize outputs into a structured quantitative feature vector based on DiscoUQ metrics. Does not make a trading decision.

Execution Flow and Independence Requirements

The protocol operates in three distinct, sequential phases to preserve independence and prevent informational cascades.

Phase 1: Isolated First-Pass (Zero Knowledge)

The raw trading setup data is passed independently to Agents 1, 2, and 3. They are entirely unaware of each other's existence or outputs. Each agent generates a structured response containing:

Their primary analytical thesis.
The specific key evidence points utilized from the raw data.
An internal confidence score (0.0 to 1.0).
A self-critique statement outlining the weakest link or most likely failure mode of their own thesis.

Phase 2: Adversarial Cross-Critique (One Round Only) To prevent sycophancy loops, the debate is limited to a single round of cross-examination. Agent 1's thesis and evidence are passed to Agent 2, while Agent 2's thesis and evidence are passed to Agent 1. Agent 3 reviews both. Crucially, the prompt engineering for this phase must be highly asymmetric. Agents are specifically instructed to identify logical flaws, missing evidence, or overconfidence in their peers' reasoning. They are explicitly forbidden from simply agreeing or summarizing.

Phase 3: Feature Extraction (The Arbiter) The Arbiter ingests the entire contextual trace of Phase 1 and Phase 2. Instead of writing a prose summary or attempting to force a consensus, the Arbiter executes a prompt designed to extract structured disagreement metrics based on the DiscoUQ and iMAD frameworks.

Council Outputs and the Deterministic Boundary

The Arbiter produces a standardized JSON payload containing continuous features. The council's job ends strictly at the generation of this payload.

Feature Name	Value Range	Description & Utility
`evidence_overlap_score`	0.0 - 1.0	The degree to which agents relied on the same underlying data. Low overlap indicates a fragile, highly ambiguous setup.
`divergence_depth`	Categorical (Early, Mid, Late)	Did agents disagree on the fundamental market regime (Early), or merely on the execution timing (Late)?.
`hesitation_index`	0.0 - 1.0	Aggregate score of linguistic hesitation cues (hedging, internal contradictions) extracted from the self-critiques.
`epistemic_uncertainty_score`	0.0 - 1.0	The calculated magnitude of cross-model semantic disagreement, combining geometric embedding distances and linguistic divergence.
`abstain_flag`	Boolean (True/False)	Flagged TRUE if the `epistemic_uncertainty_score` or `hesitation_index` exceeds a pre-calibrated threshold.

The deterministic boundary ensures that portfolio risk constraints, volatility scaling, and the actual execution router remain entirely outside the LLM architecture. If the council outputs an epistemic_uncertainty_score of 0.85, the downstream algorithmic layer deterministically applies a mathematical penalty to the setup's expected value, potentially crossing the threshold into a hard REJECT or ABSTAIN state.

Failure Modes and Mitigations

Deploying multi-agent systems in financial markets introduces unique, compounding failure modes that differ significantly from standard reasoning benchmarks. The following is a ranked list of the most critical vulnerabilities, along with architectural mitigations.

1. Collapse to Agreement (Sycophancy & Majority Pressure)

The most pervasive failure mode is the collapse of independent thought. LLMs are heavily fine-tuned via RLHF, which intrinsically rewards polite, agreeable, and sycophantic behavior. In a debate setting, minority agents often abandon correct, contrarian insights when faced with a confident majority, perfectly simulating a human informational cascade. Controlled logical reasoning studies demonstrate that "majority pressure suppresses independent correction". Mitigation: Enforce absolute isolation during the Phase 1 initial analysis. Utilize asymmetric prompting in Phase 2, explicitly instructing the Adversarial Breaker that its objective function is maximized only by finding flaws, and that agreement constitutes a failure state. Utilizing highly diverse foundation models minimizes shared RLHF behavioral collapse.

If a council relies on multiple instances of the same model family (e.g., a swarm of Llama-3 models), they will share the exact same latent space representations and training data biases. If the foundational model fundamentally misunderstands a specific volatility regime, adding more instances of that model will simply result in a high-confidence, unanimous wrong answer, creating a false sense of robustness. The Metaculus forecasting study demonstrated that homogeneous groups provided zero benefit over a single model and occasionally degraded performance. Mitigation: Enforce the architectural invariant of strict model heterogeneity. The council must cross corporate and architectural lines, mixing dense models with Mixture-of-Experts (MoE) models, to ensure uncorrelated error distributions.

3. Eloquence Override (Verbosity and Style Bias)

In unstructured debates, LLM Judges frequently conflate linguistic fluency, extreme confidence, and verbosity with analytical accuracy. An agent that writes a beautifully formatted, highly confident, but mathematically disastrous thesis can easily "persuade" the Arbiter. Studies highlight that "eloquent but incorrect arguments prevail over sound reasoning," as judges suffer heavily from verbosity and sycophancy biases. Mitigation: Eliminate conversational persuasion entirely. Force the Arbiter to evaluate outputs using a strict rubric of structured matrices rather than holistic prose reading. The DiscoUQ methodology of analyzing the geometry of the embeddings (via cosine similarity and cluster distances) rather than just the raw text bypasses stylistic bias entirely.

4. Hidden Information Leakage

When the reviewing agent (e.g., the Devil's Advocate) is only given the Thesis Builder's summary of the setup rather than the raw data itself, the reviewer is artificially constrained by the primary agent's framing. The Aris autonomous research framework defines "reviewer independence" as the capacity for the reviewer to form an assessment directly from raw artifacts; summaries inherently carry confirmation bias and silently inherit the executor's blind spots. Mitigation: All agents must be grounded on the identical, raw deterministic data feed. During the critique phase, the reviewer evaluates the raw data in conjunction with the opponent's thesis, never the thesis in a vacuum.

5. Erroneous Memory Cascades

In multi-round debates, if a factual hallucination or logic error goes unchecked in Round 1, it enters the context window and is treated as established ground truth in Round 2, compounding the error exponentially. LLMs are highly vulnerable to these erroneous memories, posing a severe threat to debate performance. Mitigation: Implement Memory Masking (MAD-M²). At the beginning of a new critique round, a specialized sub-routine polishes the context, discarding claims that lack grounding in the original prompt. Furthermore, limiting the system to a maximum of one or two critique rounds naturally caps the cascade potential.

Experimental Hypotheses

To validate the MVP within a quantitative research environment, the following concrete hypotheses should guide the first phase of backtesting. These hypotheses are designed strictly to measure epistemic value and risk calibration, isolating the council's performance from downstream portfolio execution variables.

Hypothesis 1: Structured Disagreement Outperforms Vote Counting in Calibration

Extracting disagreement structure features (e.g., divergence depth, evidence overlap, geometric cluster dispersion) via a diverse multi-agent council will yield a statistically significant improvement in Expected Calibration Error (ECE) compared to both a single-LLM baseline and a simple multi-agent majority voting mechanism. The prediction aligns with the DiscoUQ findings, hypothesizing that mapping the internal structure of disagreement more accurately tracks true epistemic uncertainty.

Hypothesis 2: Epistemic Uncertainty Improves Rejection Precision

Utilizing the council's derived epistemic_uncertainty_score as a direct filter on a baseline deterministic trading strategy will materially improve the rejection precision of the system. Specifically, the subset of trades flagged for rejection or abstention by the council will demonstrate a historically higher proportion of false positives (loss-making trades) than the unfiltered baseline, thereby increasing the gross Sharpe ratio of the remaining accepted signal.

Hypothesis 3: The Heterogeneity Premium Under Regime Shift

A deliberative council composed of three distinct foundation models (e.g., OpenAI, Anthropic, Google) will exhibit significantly lower correlated failure rates and maintain superior calibration during high-volatility regime shifts compared to a homogeneous council of three instances of the single most performant model. This tests the core invariant that diversity prevents catastrophic consensus failure.

Hypothesis 4: Self-Critique Hesitation as an Abstention Metric

The hesitation_index—derived purely from the single-agent self-critique phase before any cross-debate occurs—will exhibit a higher mutual information score with subsequent severe drawdowns than traditional technical indicators of market noise. This posits that an LLM's internal representation of ambiguity contains latent predictive value regarding market fragility.

Evaluation Design

The evaluation protocol must rigorously avoid utilizing Profit and Loss (PnL) as the primary optimization metric. PnL inherently conflates the epistemic quality of the deliberation with the mechanical quality of the downstream execution, slippage, and portfolio sizing. The experiment must focus entirely on pre-PnL metrics derived from historical out-of-sample data.

1. Rejection and Accept Precision

Establish a baseline deterministic trading strategy (e.g., a mean-reverting statistical arbitrage model or a momentum breakout system) that generates a continuous stream of setup signals. Label historical setups retroactively as True Positive (profitable after execution costs) or False Positive (loss-making). The primary evaluation metric is the council's ability to selectively prune False Positives (Rejection Precision) without inadvertently discarding True Positives (Accept Precision). A successful council increases the predictive validity of the signal before it ever reaches the portfolio construction phase.

2. Calibration via Expected Calibration Error (ECE)

The most vital function of the council is uncertainty quantification. If the council outputs an aggregated confidence score of 80% for a given subset of setups, those setups should historically succeed exactly 80% of the time. To measure this, group the council's continuous confidence outputs into deciles. Calculate the mean accuracy within each bin and compute the Expected Calibration Error (ECE). The multi-agent DiscoUQ framework demonstrates that measuring disagreement structure drastically lowers ECE compared to single-agent baselines, providing a gold standard for evaluation.

3. Abstain Quality and Ranking Monotonicity

A robust epistemic system knows when it does not know. The council should identify setups where the data is fundamentally ambiguous, resulting in high epistemic uncertainty. To measure Abstain Quality, compare the variance and maximum drawdown of the cohort of setups where the council recommended "ABSTAIN" versus those it recommended to "ACCEPT." The abstain bucket should mathematically correspond to the highest noise, lowest expectancy trades.

Furthermore, evaluate Ranking Monotonicity by ranking all accepted setups by the council's aggregated confidence score. Calculate the Spearman rank correlation between this confidence score and the subsequent realized return distribution. A perfectly monotonic ranking indicates the council is successfully and consistently grading setup quality.

4. Stability Across Market Regimes

Financial markets are famously non-stationary. A council architecture that appears highly calibrated during a low-volatility bull market may collapse entirely during a sudden liquidity crisis or macro shock. Segment the backtest data into distinct hidden Markov model (HMM) regimes (e.g., trending, mean-reverting, high-volatility, distressed). Measure the council's ECE and Rejection Precision across all regimes independently. True architectural invariants will maintain calibration stability regardless of the underlying market mechanics.

Recommended First Experiment

The smallest high-signal experiment worth running to validate this architecture is an offline, out-of-sample evaluation over 2,000 historical, deterministically generated trading setups that cover at least two distinct market regimes.

Run the 2,000 setups through a single state-of-the-art model (e.g., GPT-4o) using standard chain-of-thought prompting to establish a baseline confidence score and baseline Expected Calibration Error.
Run the identical setups through the 3+1 MVP Architecture defined above, ensuring strict model heterogeneity (e.g., Sonnet 3.5, GPT-4o, Gemini 1.5) and absolute isolation during the first pass.
Utilize the Arbiter model strictly to execute the DiscoUQ feature extraction methodology, outputting the structured vector containing the epistemic_uncertainty_score and hesitation_index.
Compare the Area Under the Receiver Operating Characteristic (AUROC) curve of the baseline single-model confidence against the council's structured disagreement vector in predicting trade success.

If the council's AUROC and calibration significantly exceed the single-model baseline, the architecture has proven its foundational premise: structured, adversarial multi-agent debate reliably generates predictive risk and fragility features that a single agent cannot perceive. This validates the integration of the deliberation layer as a critical epistemic filter within the broader quantitative trading engine.