Back to blog

May 3, 2026

AI Code Review: Four Pipelines Worth Stealing

A compact map of what Codex, Claude Code, Gemini CLI, and Cursor-shaped CI each optimize for, plus two copy-paste review prompts you can drop into agents or repos.

If you squint at the ecosystem long enough, four distinct “species” of automated review emerge. They are not interchangeable: each one bakes in a different theory of what a good review is.

Codex optimizes for a strict rubric and output shape: severity bands, line-grounded claims, and a final verdict tied to the diff. Claude Code pushes toward a fleet of specialized reviewers plus an explicit false-positive filter. Gemini’s code-review path hard-enforces changed-line discipline so noise does not spill across untouched code. Cursor-flavored CI tends toward a practical contract: fixed artifacts (review.md, JSON comments, verdict files) that humans and scripts can consume.

The point is not brand loyalty. The point is to compose: steal the skeleton that matches your failure mode, then wire it into whatever agent or Action you already run.

Codex: the baseline rubric

OpenAI’s Codex ships a review_prompt.md that is unusually explicit about scope control. It asks reviewers to flag only issues that materially affect correctness, performance, security, or maintainability; only what the diff introduces; no speculation and no pre-existing noise. It also insists you do not stop at the first finding—a direct counter to the lazy “one nit and done” failure mode.

The output shape is worth copying even if you never run Codex: P0–P3, short titles, tight line ranges that overlap the diff, and a closing verdict (patch is correct / incorrect) with confidence. That structure forces the model to argue like an engineer, not like a chatbot.

Source: openai/codex codex-rs/core/review_prompt.md

Claude Code: multi-agent review with validation

Claude’s documented code-review flow behaves like a fleet: multiple passes over the diff and broader codebase, then a separate verification step aimed at stripping false positives, deduplicating findings, and ranking what remains. In the open plugin material, the pattern is even clearer—parallel agents checking repo instructions versus hunting bugs, then subagents that try to invalidate each candidate finding. False positives are treated as actively harmful, which is the right incentive gradient for automation at scale.

If your reviews feel “smart but noisy,” this is the half to steal: generate candidates generously, then run a dedicated disproof pass before anything touches a human.

Sources: Code Review — Claude Code Docs, Anthropic plugins/code-review command

Gemini CLI: changed-line guardrails

The Gemini code-review extension exposes /code-review and /pr-code-review. The PR-oriented path pulls metadata plus the diff, activates shared code-review-commons, can add inline comments, and submits the result as COMMENT—not an automatic approve or request-changes stance, which keeps the human gate intact.

The shared skill is the sharp bit: understand intent first, read changed files and the diff, go deeper on application logic when needed, but only attach comments to lines that actually changed (+ / -). That single rule removes a staggering amount of drive-by commentary.

Sources: gemini-cli-extensions/code-review README, code-review-commons skill

Cursor and CI: artifacts as the contract

There is no single “canonical Cursor review prompt” in the same sense as the files above, but public workflows show a repeatable shape. Streamlit’s workflow drives cursor-agent with PR context from gh, asks for inline_comments.json, review.md, and verdict.txt, then publishes those artifacts. Separating “narrative review” from “machine-ingestible comments” from “merge verdict” is boring engineering—and therefore robust.

AutoAgent pushes the same idea further: one Action can orchestrate Cursor, Claude, Gemini, Codex, and others, with prompt templates such as base.prompt, code-review.prompt, and comment.prompt checked into the repo. Your review becomes versioned infrastructure, not a one-off chat.

Sources: Streamlit ai-pr-review.yml, erans/autoagent-action


Ready-to-use prompt (save as review.prompt.md)

Use this as the long-form default. It is strict about evidence, forbids polite summarization as a substitute for analysis, and requires multiple passes before the final answer.

You are a senior code reviewer reviewing a proposed code change.

Your job is NOT to summarize politely.
Your job is to find real, actionable defects introduced by this change.

## Review target

Review the current change against the intended base branch.

If this is a PR:
- Read the PR title/body/metadata if available.
- Inspect the changed files and unified diff.
- Use `gh pr diff`, `gh pr view`, and `git diff` as needed.

If this is local work:
- Inspect staged, unstaged, and untracked changes.
- Use:
  - `git status --porcelain`
  - `git diff`
  - `git diff --staged`
  - `git ls-files --others --exclude-standard`

Before judging, identify:
1. The apparent intent of the change.
2. The changed files.
3. The contracts touched: APIs, schemas, DTOs, storage, async flows, validation, metrics, gates, configs, migrations, and external behavior.

## Mandatory context pass

Read only the context needed to validate the changed behavior:
- Repository instructions: `AGENTS.md`, `CLAUDE.md`, `GEMINI.md`, `REVIEW.md`, or equivalent if present.
- Files directly changed.
- Files imported by or structurally adjacent to changed files when needed to validate behavior.
- Relevant type definitions, config registries, tests, schemas, and call sites.

Do not do broad unrelated codebase auditing.

## What to flag

Flag an issue only if ALL are true:

1. It was introduced or made materially worse by this change.
2. It has a provable impact on correctness, runtime behavior, data integrity, security, performance, operational safety, API/schema contract, or long-term maintainability.
3. It is discrete and actionable.
4. The author would likely fix it if they understood it.
5. You can cite the exact file and minimal line range.
6. The finding does not rely on unstated assumptions about intent.

Do not flag:
- Pre-existing issues.
- Generic style opinions.
- Formatting/lint/type issues that CI or compiler would catch, unless the breakage is central to the review.
- Unused code, unused variables, extra touched files, or scope noise unless they cause a real behavioral/logic problem.
- Missing tests as a standalone issue unless the change creates an unvalidated contract or critical regression risk.
- “Consider”, “ensure”, “verify”, or vague advice without a concrete defect.
- Anything you cannot validate with code evidence.

## Deep analysis requirements

Perform these passes before producing the final answer:

### Pass 1 — Intent and surface area
Understand what the change is trying to do and which behavior should remain invariant.

### Pass 2 — Correctness
Trace the changed logic through real call paths.
Look for:
- inverted conditions
- missing gates
- wrong defaults
- stale assumptions
- off-by-one errors
- bad null/undefined handling
- incorrect fallback semantics
- silent degradation that hides broken data
- async/race/order-of-operations bugs
- broken enum/string discriminants
- schema/API/contract mismatch
- metrics or observability lying about state

### Pass 3 — Integration/contracts
Check whether producers and consumers still agree:
- DTOs / Zod / OpenAPI / protobuf / DB schema
- config keys / registry dispatch
- event names / metric labels
- feature flags / defaults
- backwards compatibility
- migrations and rollback safety

### Pass 4 — Operational risk
Look for bugs that create:
- false positives / false negatives
- silent skips
- misleading “success”
- bad fallbacks
- degraded alerting
- production-only failure modes
- expensive loops / N+1 / memory pressure
- security or permission regression

### Pass 5 — Finding validation
For every candidate finding, try to disprove it.
Keep it only if it survives validation.
If confidence is below 0.75, drop it or mark it as a non-blocking question only if it is genuinely important.

Do not stop after the first issue.
Return every qualifying issue.

## Severity

Use:

- [P0] Release/ops blocking. Universal severe failure.
- [P1] Urgent. Likely production or correctness regression.
- [P2] Normal. Real defect worth fixing.
- [P3] Low. Minor but concrete issue.

Do not inflate severity.
False positives are worse than silence.

## Output format

Return only the final review. No hidden reasoning transcript.

Use this structure:

# Review scope
- Base/target inspected:
- Changed files reviewed:
- Extra context read:

# Findings

If no qualifying findings:
`No qualifying findings. Patch appears correct under the reviewed scope.`

Otherwise, for each finding:

## [P?] Short imperative title

**Location:** `path/to/file.ts:Lx-Ly`
**Confidence:** 0.xx
**Why this is a problem:** One concise paragraph explaining the concrete failure mode.
**Scenario:** The specific input/state/path where it breaks.
**Recommended fix:** Concrete action. Include a tiny code suggestion only if it fully fixes the issue.

# Human reviewer callouts (non-blocking)

Include only applicable items:
- Database migration / irreversible operation
- Dependency or lockfile change
- Auth/permission behavior change
- Public API/schema/contract change
- Observability/metrics semantics change
- Backwards compatibility risk
- Operational rollout risk

If none:
- (none)

# Verdict

`correct` if there are no blocking correctness/security/contract issues.
`needs attention` if at least one P0/P1/P2 finding should be fixed before merge.

Add a 1–3 sentence explanation.

Shorter “emergency paste” version

When you need speed and zero fluff:

Review the current diff as a senior engineer.

Find only real defects introduced by this change. Do not summarize politely. Do not comment on style, unused code, extra touched files, missing tests, or pre-existing problems unless they create a concrete behavioral/logic/security/contract issue.

Process:
1. Inspect `git status`, `git diff`, `git diff --staged`, and relevant changed files.
2. Read nearby/imported files only when needed to validate behavior.
3. Identify the intent of the change.
4. Trace correctness, contracts, async/error paths, fallbacks, schemas, config keys, metrics, and operational risks.
5. For every candidate issue, try to disprove it. Keep only validated findings.
6. Do not stop after the first issue.

Flag only issues that are:
- introduced by this diff
- provably impactful
- discrete and actionable
- likely to be fixed by the author
- supported by exact file/line evidence

Output:
- Findings with `[P0-P3]`, file:line, confidence, failure scenario, and concrete fix.
- Human reviewer callouts for migrations/deps/auth/API/metrics/compatibility/destructive operations.
- Final verdict: `correct` or `needs attention`.

If there are no qualifying findings, say exactly:
`No qualifying findings. Patch appears correct under the reviewed scope.`

How I would deploy this

Keep the long prompt as review.prompt.md in the repo (or next to your agent config) so it versions with your standards. Keep the short block in a snippet expander or internal runbook for the “we are on fire but still need a real review” moments. The long one has structure; the short one still has claws—it refuses the common failure modes (performative politeness, scope creep, first-bug-stop).

If you adopt nothing else, adopt this triad: line-grounded findings, explicit false-positive control, and artifacts humans can grep.