Back to blog

May 25, 2026

The Council Doesn't Vote: A Fourth Axis for LLM Trading Decisions

Karpathy built an LLM Council. The honest negative result he reported is the most important thing in that repo. Here's what I learned digging into it, and the minimal experiment I'm running on my own trading engine.

The Council Doesn't Vote

A fourth axis for LLM trading decisions, after Karpathy's experiment, and the contrast with people who deploy LLMs through prayer

I run a trading engine where an LLM is in the decision loop. For about eighteen months I've been improving it along three axes that every serious LLM-in-production builder eventually finds.

The first is brains: swap the foundation model for a better one. Easy gain when the frontier moves, zero invention required, runs out of road the moment everyone else does the same thing on the same day.

The second is data quality: clean the inputs, add the market-structure features that a generic model can't compute on its own, kill the noise that makes the model hallucinate from nothing. Bigger gain than swapping brains, but heavy, slow, and bounded by how much signal actually exists in the underlying market.

The third is role focus: stop asking one generalist agent to do everything. Carve the decision into narrow sub-agents — one validates the setup, one scores the risk, one reasons about geometry — each with a tight prompt and a single job. This is where most of my real gains have come from in the last year.

All three of these axes are about making one LLM, somewhere in the pipeline, do its single job better.

There is a fourth axis. It's the one I want to talk about. It's about not relying on one LLM at all.


The Karpathy Moment

When Andrej Karpathy dropped llm-council in late 2025, my reaction was simultaneously "obviously" and "wait, his result is more interesting than the framing suggests".

The setup is clean: a user query goes out in parallel to a panel of frontier models — in his case GPT-5.1, Claude Sonnet 4.5, Gemini 3.0 Pro, and Grok 4. Each model answers independently. Then the responses are passed back to every model anonymously, and each one ranks the others. Finally, a designated chairman model synthesizes everything into a single final answer.

Three stages: poll, anonymized peer review, chairman synthesis. A "board of directors" of frontier LLMs.

The honest part of Karpathy's report — the part that should be the headline, not the footnote — is this. The chairman's synthesized answer was often worse than the best individual response. And the models, even when stripped of identifying signatures, rated each other more generously than they rated themselves. Sycophancy isn't an intra-model bug. It's a cultural attractor across the entire frontier, because they were all trained on overlapping human feedback that rewarded the same agreeable, polite, hedged register.

For a toy app, that's a fun result to ship and tweet. For a system where the output rejects or accepts a setup with real money on the other side, it's a red flag the size of a flagpole.

But the question Karpathy raised — does a deliberating council of LLMs produce something a single LLM can't? — is the right question for my fourth axis. The answer is yes, but only if you build the council to do something other than the thing it naturally wants to do.


What Goes Wrong When LLMs Try to Agree

The chairman-softer-than-best-individual result isn't a quirk. It's the predictable output of four overlapping failure modes that any honest multi-agent design has to engineer against.

Sycophancy. RLHF rewards agreement. Frontier models trained on overlapping human feedback share this reflex. Put them in a room and they'll converge on the most palatable middle position, not the most accurate one.

Eloquence override. When you ask one model to evaluate another's reasoning, it confuses fluency and confidence with correctness. The articulate-but-wrong answer beats the awkward-but-right one. This is the same bug that makes human committees pick the smoothest presenter.

Correlated errors. Two GPT instances make the same mistake on the same prompt. Three of them make it with higher confidence. The illusion of consensus is the most dangerous output of a homogeneous council, because it feels like evidence.

Informational cascades. The moment model B sees model A's answer before forming its own, anchoring bias takes the wheel. The diversity you paid for collapses into a single narrative with three signatures.

Humans figured this out in the twentieth century. Investment committees, intelligence analysis, medical diagnostics — every domain that has to make consequential decisions under uncertainty rediscovered the same fix. Adversarial collaboration. Devil's advocate roles. Structured dissent. The point isn't to add more brains. The point is to engineer against the social attractors that destroy whatever diversity the brains were supposed to provide.

A council that produces consensus throws away the most expensive thing it produced.


Disagreement as Feature, Not Bug

This is the flip the rest of the architecture hinges on.

If you treat a multi-LLM panel as a voting machine, you've spent three times the tokens to get a slightly noisier version of a single model's answer. The vote count is the cheapest possible compression of what just happened. It throws away the shape of the disagreement, which is the only thing the council can give you that a single model cannot.

What does "shape of disagreement" mean concretely? A few things, all measurable:

  • Evidence overlap. Did the models reach similar conclusions from the same pieces of input data, or from completely different ones? Two agents agreeing on different evidence is a fragility signal, not a strength signal — it means the right answer depends on which lens you happen to use.
  • Divergence depth. Did the models disagree on the fundamental premise (is this even the right kind of market right now?) or only on tactical execution (the premise is fine, the timing is debatable)? Early divergence and late divergence are completely different risk regimes.
  • Hesitation cues. Forced self-critique surfaces the linguistic tells of a model that doesn't actually believe its own thesis — hedging, contradictions, shallow chains of reasoning. A confident-looking output with five hesitation cues underneath is a setup the system should refuse.

None of this requires the council to produce a verdict. It requires the council to produce a structured vector describing the geometry of its own internal disagreement. That vector is the new feature you couldn't compute before. It's the entire point of the fourth axis.


Four Architectural Invariants

If you want this to actually work — and not just become an expensive way to triple your API bill while degrading your accuracy — there are four things that have to be true. I treat them as invariants because every empirical study I've read and every honest negative result (including Karpathy's) traces back to violating at least one of them.

1. Heterogeneity beats count. Three diverse foundation-model families — Anthropic, OpenAI, Google, mix of dense and MoE — beat five instances of the same model every time. Multiple copies of one model don't deliberate; they coordinate on shared blind spots. Diversity is across training data, RLHF philosophy, and architectural lineage, not across prompts to the same backend.

2. Isolation before interaction. Every agent forms its first thesis on the raw input alone, without seeing any other agent's output. The moment anchoring leaks in, the council collapses into a chorus. Phase one is independent. Phase two is structured critique on the now-finalized first-pass outputs. Phase three never re-opens phase one.

3. Structured output, not prose. The arbiter does not write a paragraph. It emits a fixed JSON shape: evidence overlap score, divergence depth category, hesitation index, abstain flag. Prose summaries inherit eloquence override. Structured numeric output strips the rhetoric off and leaves the geometry.

4. Deterministic boundary. The council never decides anything that matters. It produces features. A deterministic risk layer downstream reads those features and applies hard rules — accept, reject, abstain, size-down. The execution stays in code you can read line by line. The LLM stays in the role it can actually do well: opinion-generation under uncertainty.

If you violate any of these, you don't have a council. You have a slightly more expensive single model.


The Minimum Viable Experiment

Here is the smallest version of this I'm actually going to run on my trading engine in the next few weeks.

Topology. Three reasoning agents and one arbiter. Three distinct foundation families. Roles are asymmetric on purpose:

  • Thesis Builder. Construct the strongest possible case for the setup based on the raw input.
  • Adversarial Breaker. Assume the setup is a trap. Find the failure mode. The prompt explicitly states that the breaker's objective is maximized only by finding flaws — agreement is a failure state for this role.
  • Systematic Verifier. Ignore narrative entirely. Reason from base rates, liquidity conditions, and structural mechanics.
  • Arbiter. Read all three traces. Emit the structured vector. No prose, no verdict.

Flow. Phase one: raw setup data goes to all three reasoning agents in parallel, zero knowledge of each other. Each produces a thesis, the evidence it leaned on, an internal confidence score, and a self-critique naming its own weakest link. Phase two: one — exactly one — round of cross-critique. The thesis-builder's output goes to the breaker and vice versa; the verifier reviews both. Asymmetric prompts forbid agreement-for-its-own-sake. Phase three: the arbiter ingests the full trace and emits the vector.

Evaluation. This is where most "AI agent" projects quietly cheat. I'm not optimizing for PnL on this experiment, because PnL conflates the council's epistemic quality with execution slippage and sizing — and one of them will mask the other regardless of which way it goes.

I'm optimizing for three pre-PnL metrics on a few thousand historical setups, out of sample, across at least two distinct market regimes.

  • Calibration. If the council says 80% confidence on a bucket of setups, those setups should historically resolve favorably 80% of the time. Expected Calibration Error against deciles. This is the part I care about most.
  • Rejection precision. Of the setups the council flags for rejection or abstain, how many were genuinely false positives that would have lost money? How many true positives did it accidentally throw out?
  • Ranking monotonicity. Sort accepted setups by the council's aggregated confidence. The realized return distribution should monotonically improve as confidence climbs. If it doesn't, the score isn't a score, it's a vibe.

If the council beats a single best-in-class model on calibration and rejection precision — and I think it will, because the shape-of-disagreement signal is genuinely new information a single agent can't produce — then the fourth axis is real. If it doesn't, I've learned something cheap and important, and I move on.

That's the whole experiment. It's small. It's measurable. It has a defined failure case. It costs maybe one weekend of engineering and a small API budget.


The Contrast

There are two ways to use LLMs in production right now.

One is the way the loud half of the timeline uses them. You take the latest frontier model, hand it whatever data you have, write a prompt that sounds confident, ship it, post a thread about how AI changed everything. When it fails — and it fails, repeatedly, in expensive ways that don't show up in screenshots — you blame the model, or you wait for the next version, or you add another prompt on top, or you call it an "agent" and add another LLM that calls the first one. The system is held together by the assumption that the next model release will fix the holes the current one leaves. It scales exactly as far as your runway, and not one inch further.

The other way is to ask, for every place an LLM sits in your pipeline, three boring questions. Where exactly does this help? What exactly fails when it doesn't? What does the surrounding system have to look like to amplify the help and absorb the failure? The model becomes a component, not a savior. Its outputs become features, not verdicts. Its limits become design constraints, not embarrassments.

Karpathy's council, even with its underwhelming chairman synthesis, lives in the second camp. The negative result is the most valuable line in the readme — here is the thing that didn't work the way I expected, and here is honestly why. That's why the project is worth taking seriously, and why most "autonomous AI agent" demos are not.


Closing

I am not running this experiment because I expect a council of frontier models to be substantially smarter, on raw accuracy, than the best single model on the day I run it.

I'm running it because I expect it to give me a better-calibrated uncertainty signal. And in trading — in any decision domain with asymmetric losses and a non-stationary distribution underneath — calibration is what separates a system from a slot machine. A model that knows when it doesn't know is worth more than a model that's slightly more accurate but always confident.

If the experiment works, the gain isn't "the council is smarter than the single model". The gain is "the council can tell me, with a measurable structured signal, when not to trust the council itself".

That's the only kind of intelligence improvement that actually deploys.


Sources:

kaido.team — one operator, a fleet of agents, under one flag.

the name